In the preceding posts, we had a quick look at what Zynq-7000 is (Path to Programmable Blog 1 - Getting Started), the workflow (Path to Programmable Blog 2 - Xilinx Tool Flow & Getting Started with Zynq-7000) and we configured a couple of PS peripherals and ran tests (Path to Programmable Blog 3 - PS Peripheral Configuration & TCL).
Now comes the important part: making the PL talk to the PS & DRAM, which will be used in probably every design that targets Zynq.
HW Chapter 6 video: Merging the PS & PL
There are two types of interfaces between the PS & PL:
- Functional interfaces which include AXI interconnect, EMIO, interrupts, DMA flow control, clocks, and debug interfaces. IP blocks in the PL can connect to these.
- Configuration signals which include the processor configuration access port, configuration status, single event upset & Program/Done/Init. These signals are connected to fixed logic within the PL configuration block, providing PS control.
The functional interfaces allow us to transfer data: the 4 General Purpose AXI, the 4 High Performance AXI Ports & the Accelerator Coherency Port.
- The 4 General Purpose AXI are of 2 types:
- 2x M_AXI_GPx - where the PS is the Master & PL is the Slave
- 2x S_AXI_GPx - where the PL is the Master & PS is the Slave
- This allows both, the PS & PL to be the initiator (Master) depending on the use case.
- The PL can access the PS IOP and PS Slaves using the 2 Slave GP Ports (S_AXI_GPx) & although memory access is possible, it's slow.
- The PS can access designs in the PL using the 2 Master GP Ports (M_AXI_GPx).
- The 4x S_AXI_HPx ports allow the PL to directly access the PS OCM & Memory Controller with very low latency & have FIFOs built into the interface for streaming.
- The S_AXI_ACP (Accelerator Coherency Port) can access all PS memory & peripherals, and has very low latency since it is connected to the Snoop Control Unit, which is hop away from the L1 & L2 caches.
All of these interfaces are based on the AMBA AXI 3.0 protocol, but it seems that Xilinx IP transparently handles the conversion at exposes AXI 4.0 to the user.
The 3 types of AXI 4.0 interfaces that are available to the user are AXI4 (for high performance, memory-mapped), AXI4-Lite (simple, low throughput eg. control registers) & AXI4-Stream.
The interconnect is complex (and very interesting). You can find more information over here:
Lab 5 - Adding a PL Peripheral
This lab involved adding a peripheral to the PL (Block RAM) and connecting it to the PS via the AXI Interconnect.
Picking up from where we left off, we add the AXI BRAM Controller IP:
After making a couple of changes to the IP (bus width etc), run Block Automation, which will automatically add a Block Memory Generator
The PS doesn't have a AXI Master Port, so edit the PS7: Enable X_AXI_GP0 and enable FCLK_CLK0 (50Mhz). Run Connection Automation once more:
Vivado automatically adds the AXI Interconnect Block, the PS Reset and the Designer Assistant makes connections between the BRAM Controller, AXI Interconnect & PS7. It also wires up the Clock & Reset.
The Address Editor tab shows us the address to which the BRAM Controller has been mapped.
Here's what the 'high level' schematic looks like. The BRAM Generator is 2nd from the left, followed by the AXI BRAM Controller, AXI Interconnect & Zynq7.
Since this is an implemented design, Vivado lets you look at what's in each of those block. The BRAM Generator has a couple of Flip-Flops & LUTs which eventually connect to a RAMB36E1, which is the primitive for a "36K-bit Configurable Synchronous Block RAM", or Block RAM. The output width is 32 bits, and since we had set the width of the BRAM Generator to 64 bits, 2 of these are connected in parallel
I tried tracing the path of the databus from the BRAM to the AXI Interface, which involved opening up the lower levels of components, which displays the actual primitives that the design is mapped to in hardware. This also exposes many internal datapaths, and after expanding the cone a couple of times, Vivado was already displaying over 10000 Nets. For reference, the image on the right is a zoomed in version of the right of the highlighted section on the left. Thanks to Block Automation, we do not need wire all this up manually!
This is what the implemented design looks like:
We don't have much of a design in the PL (technically, only BRAM), but BRAM Controller & Interconnect use up some of the programmable logic.
HW Chapter 7 video: Zynq PS DMA Controller
Now that the BRAM in the PL is connected to PS, its time to consider how it'll be used. Since it's been mapped to a memory address, the simplest way would be to use pointers to copy data to/from the memory address to an array. However, this isn't the best solution when it comes to performance, since the data would need to go from the PL to the Central Interconnect via the Slace GP port, then to the On-Chip Memory and L2 Cache before making it to the DRAM. The CPU processes each transfer, so it gets held up as well. However, if you make use of DMA, not only is the CPU free to continue executing, but the data path is shorter since it bypasses the cache.
The DMA controller itself is complex: transfers are controlled by the DMA instruction execution engine which has its own instruction set. It supports upto 8 channels, each of which has its own thread and uses round robin arbitration to ensure that all channels have equal priority.
As usual, the Zynq-7000 TRM contains details like the instruction microcode, DMA initialization, interrupts etc.
PL based DMA controllers can also be added in the form of IP, and these can make use of the AXI_HP ports to interface directly with the DRAM Controller.
Lab 6 - Improving Data flow between PL and PS utilizing PS DMA
Continuing from Lab 5, we export the project in Vivado & launch Xilinx SDK.
Create a new BSP and note that the 'system.hdf' file lists the bram & AXI interfaces that were added to the design in Vivado.
Next, we import the 'dma_test.c' file that was provided with the training material and add it to the source of a new application in SDK, which gets built automatically.
Since we've got a design (BRAM, AXI Interconnect etc.) that needs to be mapped to the FPGA fabric (PL), we need to program the PL, which is done using the bitstream.
We can do this manually, or tick the checkbox in the debug/run configuration that does it automatically.
After this, open up a Serial terminal and click 'run'.
The 'dma_test.c' file that we imported had code that gives the user the option of executing different types of transfers (BRAM to BRAM, DRAM to DDR3 or DDR3 to DDR3) of varying sizes.
After initializing the hardware, it first performs the transfer using the CPU, and repeats the process using DMA. The clock cycles taken are logged and printed out.
Here's the part of the code that initializes the CPU & DMA transfers and this is what the results look like:
Unsurprisingly, DMA is a lot quicker.
There's no doubt that Zynq peripherals are complex & a little complicated to understand and work with at first, but Xilinx provides drivers to make things easier.