element14 Community
element14 Community
    Register Log In
  • Site
  • Search
  • Log In Register
  • Community Hub
    Community Hub
    • What's New on element14
    • Feedback and Support
    • Benefits of Membership
    • Personal Blogs
    • Members Area
    • Achievement Levels
  • Learn
    Learn
    • Ask an Expert
    • eBooks
    • element14 presents
    • Learning Center
    • Tech Spotlight
    • STEM Academy
    • Webinars, Training and Events
    • Learning Groups
  • Technologies
    Technologies
    • 3D Printing
    • FPGA
    • Industrial Automation
    • Internet of Things
    • Power & Energy
    • Sensors
    • Technology Groups
  • Challenges & Projects
    Challenges & Projects
    • Design Challenges
    • element14 presents Projects
    • Project14
    • Arduino Projects
    • Raspberry Pi Projects
    • Project Groups
  • Products
    Products
    • Arduino
    • Avnet & Tria Boards Community
    • Dev Tools
    • Manufacturers
    • Multicomp Pro
    • Product Groups
    • Raspberry Pi
    • RoadTests & Reviews
  • About Us
    About the element14 Community
  • Store
    Store
    • Visit Your Store
    • Choose another store...
      • Europe
      •  Austria (German)
      •  Belgium (Dutch, French)
      •  Bulgaria (Bulgarian)
      •  Czech Republic (Czech)
      •  Denmark (Danish)
      •  Estonia (Estonian)
      •  Finland (Finnish)
      •  France (French)
      •  Germany (German)
      •  Hungary (Hungarian)
      •  Ireland
      •  Israel
      •  Italy (Italian)
      •  Latvia (Latvian)
      •  
      •  Lithuania (Lithuanian)
      •  Netherlands (Dutch)
      •  Norway (Norwegian)
      •  Poland (Polish)
      •  Portugal (Portuguese)
      •  Romania (Romanian)
      •  Russia (Russian)
      •  Slovakia (Slovak)
      •  Slovenia (Slovenian)
      •  Spain (Spanish)
      •  Sweden (Swedish)
      •  Switzerland(German, French)
      •  Turkey (Turkish)
      •  United Kingdom
      • Asia Pacific
      •  Australia
      •  China
      •  Hong Kong
      •  India
      •  Japan
      •  Korea (Korean)
      •  Malaysia
      •  New Zealand
      •  Philippines
      •  Singapore
      •  Taiwan
      •  Thailand (Thai)
      •  Vietnam
      • Americas
      •  Brazil (Portuguese)
      •  Canada
      •  Mexico (Spanish)
      •  United States
      Can't find the country/region you're looking for? Visit our export site or find a local distributor.
  • Translate
  • Profile
  • Settings
Avnet Boards Forums
  • Products
  • Dev Tools
  • Avnet & Tria Boards Community
  • Avnet Boards Forums
  • More
  • Cancel
Avnet Boards Forums
ZedBoard Hardware Design Custom peripheral for computational acceleration
  • Forum
  • Documents
  • Members
  • Mentions
  • Sub-Groups
  • Tags
  • More
  • Cancel
  • New
Join Avnet Boards Forums to participate - click to join for free!
Actions
  • Share
  • More
  • Cancel
Forum Thread Details
  • State Not Answered
  • Replies 11 replies
  • Subscribers 355 subscribers
  • Views 975 views
  • Users 0 members are here
Related

Custom peripheral for computational acceleration

jamestkennedy
jamestkennedy over 13 years ago

One of the touted uses of the Zynq is as an accelerator of software computation. When I attended X-fest in May, an entire session was devoted to this subject. However its treatment on this forum has been sparse at best.
I have gone thru ZynqGeek's tutorial for create a custom peripheral on the AXI bus. I can successfully write and read from the peripheral registers from an ARM elf file. However I have been less successful in extending the user logic to make use of the periperal registers to do parallel computation in custom HDL. When I attempt to include a post-translate simulation verified module (and sub-modules) into the user_logic(.v) module, it does not work. XST ends up trimming the submodule because I obviously have not done it properly.
The tutorial that one post mentioned by Silica is outdated and only adds an output port to be fed to external pins. What I am interested in is as was discussed in the zynq acceleration session.
Do you have any reference designs that include such a custom peripheral for which a coprocessor for, say, a dantziq simplex linear programming matrix solution could be implemented on the ZedBoard. I have initially chosen AXI bus slave registers as the means to supply the M by N matrix, but realize that using AXI DMA and/or the ACP may be more efficient.
This area of PS/PL collaberation for advanced embedded designs seems an appropriate step for advancing the knowledge base here. Can we expect reference designs of this nature?

  • Sign in to reply
  • Cancel
  • jamestkennedy
    0 jamestkennedy over 13 years ago

    well, perusing my ISE synthesis I found that I had not connecting my simplex accelerator reset port. Correcting this, the PlanAhead synthesis included the user_logic module (mine is verilog), and PAR's to 48% of the LUTs on the Zynq 7020 (think it will work?). However, Bitgen squacks abot the use of the AXI bus clock with numerous WARNING:PhysDesignRules:372 - Gated clock, but still creates the bit file. Next I will attempt to use it from an ARM elf. Any comments?

    • Cancel
    • Vote Up 0 Vote Down
    • Sign in to reply
    • Verify Answer
    • Cancel
  • jamestkennedy
    0 jamestkennedy over 13 years ago

    Regarding my user_logic moddule, I have found that when my computational module output ports are driven, 62% of the slices and 48% of the LUTS are used. With this .bit, I can no longer write and read the AXI slave registers of the custom peripheral. If I leave the ports undriven, the computational module is trimmed, and I can write/read the ports. Is this related to the clocking issues reported or is the FPGA saturated at this level? My graduate professor in FPGA synthesis spoke of a ceiling in device utilization where thing began to fail. Any thoughts? I think I will modularize the ports in my design and make the AXI registers specifically read or write and see if this makes any difference in functionality. My TableauSimplex module was tested in post synthesis/translation (too many ports to pass Mapping) and found to be functional. Again it would be nice to see reference designs that accomplish the custom peripheral linkage that I am attempting. Thanks.

    • Cancel
    • Vote Up 0 Vote Down
    • Sign in to reply
    • Verify Answer
    • Cancel
  • jamestkennedy
    0 jamestkennedy over 13 years ago

    I have a working system with the simple AXI register interface. I modified the user_logic module to use registers as input exclusively to assign input ports of the Simplex submodule. And as output registers to to receive the output ports. My PL design uses 46% of LUTs and 56% of the slices on the Zynq 7020. It includes a hierarchy of 99 verilog modules and one-hot FSM. At its most parallel point 30 FSM are concurrent with 14 actively transitioning. Comparing the accelerator to execution on the PS ARM, it takes about half the time with a very conservative synchronous approach (20 us, 38 us ARM C). However when implemented as done through AXI registers to send/receive data and commands to the user logic, the overhead of all the register puts and gets ends up adding a quarter of the time that the ARM codes consumes (48 us, 38 us for ARM C).

    Absolutely I need to explore more efficient methods of limking the PS and the PL of the accelerator. For the scale of my matrix I think the ACP wth ARM AXI interrupt is the likely candidate. For a more mature approach to large scale matrix RSM (Revised Simplex Method) solutions, DMA and more PL (7045 and beyond - attached Virtex 7's) is a likely platform candidate for a macro smart grid solution.

    BUT, gotta walk before you run, so i am very pleased to get this accelerator with such extensive use of the Zedboard 7020 PL working.

    I anxiously await more ref designs to guide my further forays into this subject of computation acceleration.

    • Cancel
    • Vote Up 0 Vote Down
    • Sign in to reply
    • Verify Answer
    • Cancel
  • jamestkennedy
    0 jamestkennedy over 13 years ago

    Adjusting the user logic FSM, I shaved the PL cycle time to 33 us, thereby making it a true accelerator over the PS (38 us) by 15%. There is more to be had I am sure!

    • Cancel
    • Vote Up 0 Vote Down
    • Sign in to reply
    • Verify Answer
    • Cancel
  • jamestkennedy
    0 jamestkennedy over 13 years ago





    Isolating the computation from the interface ports, i arrived at 30 us; i predict i will reach 25 us with retiming... and here is where scale will evenually set the acceleration to multiple orders of magnitude with more appropriate matrices (10k?). besides, we will be using real hardware beyond the toy zedboard, viz 7040's with attached banks of virtex 7's. But the crux of this exercise now is the use of the DMA via the ACP. (these data may eventually arrive directly from ethernet DMA into memory.) Anybody else looking forward to receiving a parallella too?

    • Cancel
    • Vote Up 0 Vote Down
    • Sign in to reply
    • Verify Answer
    • Cancel
  • Former Member
    0 Former Member over 13 years ago

    Hi James,

    Congrats on your success with PL/PS co-operation.
    It would be great if you could come up with a small tutorial and tell us how you did it.

    I cant see Xilinx coming up with any such ref designs soon... :(

    Thanks,
    Anup.

    • Cancel
    • Vote Up 0 Vote Down
    • Sign in to reply
    • Verify Answer
    • Cancel
  • Former Member
    0 Former Member over 12 years ago

    Hey, incidentally I'm trying to do a similar thing: Accelerating linear algebra operations using the PL. I've already done this over the AXI GPIO stuff from the SpeedWay tutorials (not very fast) and over the AXI HP slave port (faster, but still takes forever to transmit data). Have you managed to get an ACP setup working yet?

    • Cancel
    • Vote Up 0 Vote Down
    • Sign in to reply
    • Verify Answer
    • Cancel
  • jamestkennedy
    0 jamestkennedy over 12 years ago in reply to Former Member

    The AXI slave registers are what I used to provide the dataset to my custom accelerator.
    See the thread "PS/PL BRAM share" to see my progress on using the ACP.
    I used the ACP to move datasets into BRAM instantiated in XPS. But then I got bogged down when trying to access the BRAM from within my custom IP user logic. The resultant dual bus AXI and BRAM interface module is giving me problems where the user_logic modules are trimmed in synthesis due to the way the BRAM access is coded in HDL. I needed to add the BRAM interface to my IP's .mpd. XST is somehow setting the BRAM ports to constant values and trimming the module.
    So I changed to the AXI burst mode in the CIPW and now am working with that mode of transferring the dataset. This creates inferrred BRAM within the IP user_logic, which you can use to transfer data sets and results. However, I don't think the burst mode is supported in AXI_Lite, so I tjhink you need to convert your XPS design to AXI from AXI_lite.
    I still work on both versions though. I can't see why you can't access the BRAM from your IP and am reading the 400+ pages of HDL for the AXI_BRAM_CTRL Xilinx IP to get insighta in how to have AXI and BRAM bus interfaces in coexistence.
    I am curious. What is the size of your dataset, and did you use the infered BRAM method (AXI Burst capable)?

    • Cancel
    • Vote Up 0 Vote Down
    • Sign in to reply
    • Verify Answer
    • Cancel
  • Former Member
    0 Former Member over 12 years ago in reply to jamestkennedy

    Hey,

    Sorry I took so long to reply, somehow the forum did not send me a notice of any kind about your comment here.
    My dataset is around 2048 bytes per transmission (2 16x16 matrices of single precision floats) and I did not
    take any special precautions for BRAM - I actually managed to simply hook up an AXI master burst peripherial
    created using the CIP wizard to the ACP bus, and throughput has increased a lot compared to the HP0 bus. This
    is mainly due to the ACP removing the need for cache flushes and invalidates, which take a lot of time.
    What's your reason for using BRAMs? Are your datasets larger?

    • Cancel
    • Vote Up 0 Vote Down
    • Sign in to reply
    • Verify Answer
    • Cancel
  • Former Member
    0 Former Member over 12 years ago in reply to jamestkennedy

    Hey,

    Sorry I took so long to reply, somehow the forum did not send me a notice of any kind about your comment here.
    My dataset is around 2048 bytes per transmission (2 16x16 matrices of single precision floats) and I did not
    take any special precautions for BRAM - I actually managed to simply hook up an AXI master burst peripherial
    created using the CIP wizard to the ACP bus, and throughput has increased a lot compared to the HP0 bus. This
    is mainly due to the ACP removing the need for cache flushes and invalidates, which take a lot of time.
    What's your reason for using BRAMs? Are your datasets larger?

    • Cancel
    • Vote Up 0 Vote Down
    • Sign in to reply
    • Verify Answer
    • Cancel
>
element14 Community

element14 is the first online community specifically for engineers. Connect with your peers and get expert answers to your questions.

  • Members
  • Learn
  • Technologies
  • Challenges & Projects
  • Products
  • Store
  • About Us
  • Feedback & Support
  • FAQs
  • Terms of Use
  • Privacy Policy
  • Legal and Copyright Notices
  • Sitemap
  • Cookies

An Avnet Company © 2026 Premier Farnell Limited. All Rights Reserved.

Premier Farnell Ltd, registered in England and Wales (no 00876412), registered office: Farnell House, Forge Lane, Leeds LS12 2NE.

ICP 备案号 10220084.

Follow element14

  • X
  • Facebook
  • linkedin
  • YouTube