Zedboard 512x512 matrices, % utilization problem

My objective is to read seven 512X512 float matrices from the SD card to the DDR memory (step accomplished already with each matrix occupying around 1Mb), then pass them from DDR to my custom IP block (I'm doing this transition with AXI DMA block), normalize them innside the custom IP block and then output them to DDR memory (also with the AXI DMA block).

Well, I'm doing my custom IP block in Vivado HLS and following the steps that I saw in this Xilinx manual (which shall be the ideal way to do this since its from Xilinx): http://www.xilinx.com/support/documentation/application_notes/xapp1170-zynq-hls.pdf

It works for a 32x32 matrix.

But unfortunately, when increasing the matrix dimensions to 512x512, even doing only a multiplication by 2.0 of each matrix' parameter, the BRAM_18K utilization is 365%!!

What can I do do brutaly decrease the % of resources used? I'll need to do lots of operations to the matrices inside the custom IP block and if a simple multiplication by 2.0 uses 365% of BRAMs a solution that decreases the amount of this example to 80/90% is not good enough. What I'm looking for is a solution that sets the BRAM utilization to around 5% in this example.

0 Former Member over 10 years ago

This question is Xilinx tool related and is probably better targeted to one of the Xilinx Community forums, perhaps the HLS forum in particular: http://forums.xilinx.com/t5/High-Level-Synthesis-HLS/bd-p/hls

That said, as I am sure you know, multiplying 512 x 512 float is a pretty hefty operation and you just may need to target a Zynq device with more BRAM than the 7020 device on the ZedBoard. You might want to look at a PicoZed 7030 or the one of the Zynq MMP or Mini-ITX boards with a 7045 or 70100 device.

-Gary
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 jmales over 10 years ago in reply to Former Member

I also tried posting there but that community is really slow and no one answers...

So do you mean that I can't even multiply 512x512 matrices in a Zedboard in your opinion? So what's the FPGA side of the Zedboard for? I see no use for it if it can't even do a simple operation (in the software world as you know, 512x512 matrices' operations are considered very very simple ones)
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 zedhed over 10 years ago

Hi jmales,

Would you please post the link to your thread in the Xilinx forums so that others might follow it upon stumbling across this thread?

To be fair to those who may stumble across this post here, I think it is important to put things into perspective here. Performing this operation in the software word, a 512x512 matrix multiply is straightforward but not a computationally fast task.

According to this independent source, it can take on the order of 12,805,000us to complete this operation on a Pentium 4 processor:

http://csg.csail.mit.edu/pubs/memos/Memo-503/memo503.pdf

Since the computation can be parallelized, the computation time could potentially be reduced to a fraction of the software compute time within a Programmable Logic device (like the Zynq 7020 device available on ZedBoard). To fully unroll the loops and pipeline the datapath can require a lot of resources to accomplish this. You might be able to adjust the utilization of the programmable logic resources to reuse some portions of the logic but there will be computation tradeoffs involved with reducing the device utilization.

I suggest taking a look at our "Designing Accelerators for the Zynqu00AE-7000 All Programmable SoC" course on the training page as an introduction to this type of problem.

http://microzed.org/support/trainings-and-videos

This might help you see if something like Vivado HLS pragmas might help you solve your problem with acceptable results. We ran into a (loosely) similar device utilization problem when targeting those labs from ZedBoard 7020 over to a MicroZed 7010 but were able to work around them by constraining the design with "block factor" directives for array_partitioning. I don't think this will get you to 5% BRAM utilization, but it will reduce the utilization of BRAM dramatically. Take a look at the Vivado HLS user guide for more information on the ARRAY_PARTITION directive that we used.

Keep in mind, as Gary suggested, if your computation problem is sufficiently complex, you might simply need a larger device than Zynq-7020 in order to have enough Programmable Logic resources available.

Regards,

-Kevin
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel