Custom peripheral for computational acceleration

One of the touted uses of the Zynq is as an accelerator of software computation. When I attended X-fest in May, an entire session was devoted to this subject. However its treatment on this forum has been sparse at best.
I have gone thru ZynqGeek's tutorial for create a custom peripheral on the AXI bus. I can successfully write and read from the peripheral registers from an ARM elf file. However I have been less successful in extending the user logic to make use of the periperal registers to do parallel computation in custom HDL. When I attempt to include a post-translate simulation verified module (and sub-modules) into the user_logic(.v) module, it does not work. XST ends up trimming the submodule because I obviously have not done it properly.
The tutorial that one post mentioned by Silica is outdated and only adds an output port to be fed to external pins. What I am interested in is as was discussed in the zynq acceleration session.
Do you have any reference designs that include such a custom peripheral for which a coprocessor for, say, a dantziq simplex linear programming matrix solution could be implemented on the ZedBoard. I have initially chosen AXI bus slave registers as the means to supply the M by N matrix, but realize that using AXI DMA and/or the ACP may be more efficient.
This area of PS/PL collaberation for advanced embedded designs seems an appropriate step for advancing the knowledge base here. Can we expect reference designs of this nature?

0 jamestkennedy over 13 years ago

well, perusing my ISE synthesis I found that I had not connecting my simplex accelerator reset port. Correcting this, the PlanAhead synthesis included the user_logic module (mine is verilog), and PAR's to 48% of the LUTs on the Zynq 7020 (think it will work?). However, Bitgen squacks abot the use of the AXI bus clock with numerous WARNING:PhysDesignRules:372 - Gated clock, but still creates the bit file. Next I will attempt to use it from an ARM elf. Any comments?
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 jamestkennedy over 13 years ago

Regarding my user_logic moddule, I have found that when my computational module output ports are driven, 62% of the slices and 48% of the LUTS are used. With this .bit, I can no longer write and read the AXI slave registers of the custom peripheral. If I leave the ports undriven, the computational module is trimmed, and I can write/read the ports. Is this related to the clocking issues reported or is the FPGA saturated at this level? My graduate professor in FPGA synthesis spoke of a ceiling in device utilization where thing began to fail. Any thoughts? I think I will modularize the ports in my design and make the AXI registers specifically read or write and see if this makes any difference in functionality. My TableauSimplex module was tested in post synthesis/translation (too many ports to pass Mapping) and found to be functional. Again it would be nice to see reference designs that accomplish the custom peripheral linkage that I am attempting. Thanks.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 jamestkennedy over 13 years ago

I have a working system with the simple AXI register interface. I modified the user_logic module to use registers as input exclusively to assign input ports of the Simplex submodule. And as output registers to to receive the output ports. My PL design uses 46% of LUTs and 56% of the slices on the Zynq 7020. It includes a hierarchy of 99 verilog modules and one-hot FSM. At its most parallel point 30 FSM are concurrent with 14 actively transitioning. Comparing the accelerator to execution on the PS ARM, it takes about half the time with a very conservative synchronous approach (20 us, 38 us ARM C). However when implemented as done through AXI registers to send/receive data and commands to the user logic, the overhead of all the register puts and gets ends up adding a quarter of the time that the ARM codes consumes (48 us, 38 us for ARM C).

Absolutely I need to explore more efficient methods of limking the PS and the PL of the accelerator. For the scale of my matrix I think the ACP wth ARM AXI interrupt is the likely candidate. For a more mature approach to large scale matrix RSM (Revised Simplex Method) solutions, DMA and more PL (7045 and beyond - attached Virtex 7's) is a likely platform candidate for a macro smart grid solution.

BUT, gotta walk before you run, so i am very pleased to get this accelerator with such extensive use of the Zedboard 7020 PL working.

I anxiously await more ref designs to guide my further forays into this subject of computation acceleration.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 jamestkennedy over 13 years ago

Adjusting the user logic FSM, I shaved the PL cycle time to 33 us, thereby making it a true accelerator over the PS (38 us) by 15%. There is more to be had I am sure!
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 jamestkennedy over 13 years ago

Isolating the computation from the interface ports, i arrived at 30 us; i predict i will reach 25 us with retiming... and here is where scale will evenually set the acceleration to multiple orders of magnitude with more appropriate matrices (10k?). besides, we will be using real hardware beyond the toy zedboard, viz 7040's with attached banks of virtex 7's. But the crux of this exercise now is the use of the DMA via the ACP. (these data may eventually arrive directly from ethernet DMA into memory.) Anybody else looking forward to receiving a parallella too?
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 Former Member over 13 years ago

Hi James,

Congrats on your success with PL/PS co-operation.
It would be great if you could come up with a small tutorial and tell us how you did it.

I cant see Xilinx coming up with any such ref designs soon... :(

Thanks,
Anup.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 Former Member over 13 years ago

Hey, incidentally I'm trying to do a similar thing: Accelerating linear algebra operations using the PL. I've already done this over the AXI GPIO stuff from the SpeedWay tutorials (not very fast) and over the AXI HP slave port (faster, but still takes forever to transmit data). Have you managed to get an ACP setup working yet?
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 jamestkennedy over 13 years ago in reply to Former Member

The AXI slave registers are what I used to provide the dataset to my custom accelerator.
See the thread "PS/PL BRAM share" to see my progress on using the ACP.
I used the ACP to move datasets into BRAM instantiated in XPS. But then I got bogged down when trying to access the BRAM from within my custom IP user logic. The resultant dual bus AXI and BRAM interface module is giving me problems where the user_logic modules are trimmed in synthesis due to the way the BRAM access is coded in HDL. I needed to add the BRAM interface to my IP's .mpd. XST is somehow setting the BRAM ports to constant values and trimming the module.
So I changed to the AXI burst mode in the CIPW and now am working with that mode of transferring the dataset. This creates inferrred BRAM within the IP user_logic, which you can use to transfer data sets and results. However, I don't think the burst mode is supported in AXI_Lite, so I tjhink you need to convert your XPS design to AXI from AXI_lite.
I still work on both versions though. I can't see why you can't access the BRAM from your IP and am reading the 400+ pages of HDL for the AXI_BRAM_CTRL Xilinx IP to get insighta in how to have AXI and BRAM bus interfaces in coexistence.
I am curious. What is the size of your dataset, and did you use the infered BRAM method (AXI Burst capable)?
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 Former Member over 13 years ago in reply to jamestkennedy

Hey,

Sorry I took so long to reply, somehow the forum did not send me a notice of any kind about your comment here.
My dataset is around 2048 bytes per transmission (2 16x16 matrices of single precision floats) and I did not
take any special precautions for BRAM - I actually managed to simply hook up an AXI master burst peripherial
created using the CIP wizard to the ACP bus, and throughput has increased a lot compared to the HP0 bus. This
is mainly due to the ACP removing the need for cache flushes and invalidates, which take a lot of time.
What's your reason for using BRAMs? Are your datasets larger?
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 Former Member over 13 years ago in reply to jamestkennedy

Hey,

Sorry I took so long to reply, somehow the forum did not send me a notice of any kind about your comment here.
My dataset is around 2048 bytes per transmission (2 16x16 matrices of single precision floats) and I did not
take any special precautions for BRAM - I actually managed to simply hook up an AXI master burst peripherial
created using the CIP wizard to the ACP bus, and throughput has increased a lot compared to the HP0 bus. This
is mainly due to the ACP removing the need for cache flushes and invalidates, which take a lot of time.
What's your reason for using BRAMs? Are your datasets larger?
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel