I'm following the 3-part using a HLS stream IP with DMA training on the PYNQ community. This blog will not repeat the steps. The goal is to document the experience.
Use the Accelerated function in Software
In part 1, we made the hardware accelerated function
example(stream &in, stream &out).
In part 2, we created a Vivado hardware design with the accelerator IP included. This makes the function available for your programs.
On this part, that accelerated function is used in a python program.
Refresher 1: What does the function do?
The example function accepts a stream of integers, adds the constant 5 to each value in that stream, and outputs the result. This was the accelerated IP made in Vitis HLS in training 1.
tmp.data = tmp.data.to_int() + 5;
Refresher 2: What does the resulting hardware design look like?
In training 2, the FPGA design was created to allow DMA data exchange between the function and the ARM part of the Zynq
Next step: Load and activate the accelerated design into the FPGA
For the Zynq, the result looks identical than other FPGA designs. It's a set of IPs that are synthesized, Implemented and written to a bitfile.
We're using a PYNQ board, and the way to load the design into the hardware is by using the overlay functions.
For convenience, an alias is created for the DMA parts and the accelerated IP. These are the parts that we'll interact with from the Python code.
Then, the accelerated IP is enabled.
Run the Accelerated Function
Like any function you use, you need to declare the variables that hold input and result.
We're using a buffer of 100 unsigned integers here, for both input and output.
Initialise the input buffer with test values
We'll send 100 different values to the function, as test. Each position in the buffer has the value of its index. E.g.: element 14 in the buffer will have a value of 14.
We send the data to the IP by enabling the input DMA. The results are retrieved by enabling the output DMA.
That's it. We've now executed the hardware accelerated function one time. It returned the 100 processed elements. We're showing the first 10 for evaluation.
The example functionality (add 5 to a number) is intentionally kept simple. It allows to focus on the techniques.
Actual speed gain is possible for complex transformations, such as image processing.
Resizing an image from 3840x2160 to 1920x1080 using the OpenCV
resize() function implemented in FPGA on my Zynq runs 4 times faster (250 ms) than the same OpenCV
resize() function running as software on the ARM (1 second).
What I learned by doing this tutorial, is that the whole cycle has become more stable and integrated.
Vitis HLS and Vivado, version 2020.2, work well together. And PYNQ's examples with DMA now work reliably.