Blog #5: CNN HW Accelerator for Handwriting Recognition - Integrating the HW Accelerator as a PYNQ Overlay

17 Nov 2024

The PYNQ (Python Productivity for Zynq) Framework

PYNQ overlays use a combination of .bit, .hwh, and optionally .tcl files to reconfigure the programmable logic (PL) on a Xilinx Zynq SoC and provide a seamless interface for Python-based interaction with the hardware. Here's a detailed breakdown of how each file plays a role:

1. `.bit` File: The Bitstream

Purpose:
The .bit file contains the binary configuration data for programming the FPGA (PL). It is generated during the hardware design flow in Vivado and defines the placement and routing of logic elements, interconnects, and other resources in the PL.
Role in PYNQ:
When an overlay is loaded in PYNQ, the .bit file is used to configure the FPGA hardware design into the PL:

python

Copy code

from pynq import Overlay overlay = Overlay("/path/to/your/overlay.bit")
- This step reconfigures the PL to match the desired hardware architecture.
- The .bit file defines the physical behavior of the hardware but does not expose its structure to Python directly.

2. `.hwh` File: Hardware Handoff

Purpose:
The .hwh file is an XML-like metadata file generated alongside the .bit file by Vivado. It describes the hardware design's internal structure, including:
- AXI interfaces.
- Address maps.
- Clock connections.
- Configuration registers of hardware IPs.
Role in PYNQ:
- PYNQ uses the .hwh file to parse the hardware's architecture and automatically map the hardware IPs to Python objects.
- For example, when you access overlay.axi_dma, PYNQ knows how to interact with the AXI DMA IP block because the .hwh file provides its address map and interface details.
- Without the .hwh file, you would need to manually provide these details, making the design less user-friendly.

3. `.tcl` File: Tcl Script (Optional)

Purpose:
The .tcl file is a Vivado script that describes how the hardware design was constructed. It includes:
- The IP block configurations.
- Connections and parameters.
- Design constraints.
Role in PYNQ:
- The .tcl file is optional and not directly used by PYNQ at runtime. However, it is helpful during the development or debugging process for regenerating or modifying the Vivado project.
- It can also be used for advanced workflows, such as modifying an overlay dynamically or automating FPGA design customization.

How These Files Work Together:

Development in Vivado:
- You design the hardware, integrate IP cores, configure interfaces, and generate the .bit and .hwh files.
- Optionally, you generate the .tcl file for reproducibility or further customization.
Overlay Creation:
- The .bit and .hwh files are packaged together as an overlay for use in PYNQ.
Loading the Overlay:
- The .bit file configures the PL with the desired design.
- The .hwh file allows PYNQ to understand the design, automatically map hardware IPs, and provide a Python API for interacting with them.
Python Integration:
- PYNQ abstracts the hardware details using the .hwh metadata. Users can control the hardware from Python without needing low-level details:
  
  python
  
  Copy code
  
  dma = overlay.axi_dma # Access the AXI DMA block dma.sendbuffer # Start sending data through DMA
- The .hwh file enables these high-level Python APIs to interact seamlessly with the hardware.

Why is This Powerful?

Dynamic Reconfiguration:
Overlays can be swapped dynamically by loading different .bit files during runtime, making PYNQ versatile for multiple applications without rebooting.
Ease of Use:
The .hwh file removes the need for manual address mapping and register configuration, providing Python libraries that abstract complex FPGA interactions.
Rapid Prototyping:
By combining these files with Python, PYNQ enables hardware acceleration for complex applications with minimal overhead in software and hardware integration.

This approach streamlines FPGA development, making it accessible even for software developers unfamiliar with low-level FPGA programming.

The Bitstream, Hardware handoff, and the TCL file for the block design are placed inside a newly created directory in Overlays directory.

How is the overlay programmed?

Communication with FPGA:
- The Zynq SoC includes an ARM processor (PS) that communicates with the FPGA fabric (PL) via a programming interface.
- The PYNQ framework uses the Xilinx FPGA Manager (or a similar driver) to send the .bit file from the Linux filesystem to the FPGA's configuration memory.
Configuration Memory Programming:
- The .bit file is streamed to the FPGA configuration memory through the Configuration Access Port (CAP).
- This process resets the PL and configures it with the new hardware design.
Verification and Completion:
- After programming, the FPGA reports a "done" signal to confirm successful configuration.
- If an .hwh file is present, it is parsed to map the hardware design's interfaces to Python objects.

The process is dynamic, meaning overlays can be swapped at runtime without rebooting the system.

Given below is snipped that programs the custom overlay into the PL of Zynq

1. Loading the FPGA Bitstream

overlay = Overlay("cnn.bit")
print(f"Overlay successfully loaded!")

Overlay("cnn.bit"): Loads the specified bitstream file (cnn.bit) onto the FPGA. This file contains the hardware design for your CNN accelerator.
overlay: Represents the loaded bitstream and provides access to its components (DMA engines, MMIO registers, etc.).
The print statement confirms that the overlay was loaded successfully.

2. Checking Overlay Status

if overlay.is_loaded:
    print("Overlay is active.")
else:
    print("Overlay failed to load.")

overlay.is_loaded: Checks whether the overlay has been successfully loaded onto the FPGA.
If the bitstream is loaded and active, it prints confirmation. Otherwise, it indicates failure.

3. Error Handling

except FileNotFoundError:
    print(f"Bitstream file not found.")
except Exception as e:
    print(f"Error loading overlay: {e}")

FileNotFoundError: Triggers if the specified bitstream file (cnn.bit) is not found in the working directory or specified path.
Exception: Catches any other errors that may occur during the bitstream loading process and prints the error details.

Next, the same array used in Vitis IDE/Xilinx SDK to test the CNN HW accelerator is added as a python list.

Using DMA to transfer the data into our custom IP

1. Input Buffer Allocation

input_buffer = allocate(shape=sampleImage.shape, dtype=np.uint8)
np.copyto(input_buffer, sampleImage)

allocate: Allocates a physically contiguous buffer in memory to be used for DMA transfer. This is critical for FPGA-to-CPU communication, as DMA requires contiguous memory regions.
np.copyto: Copies the data from the preprocessed image (sampleImage) into the allocated buffer.
The buffer's shape and data type (uint8) match the input data.

Note: sampleImage should already be resized and preprocessed (e.g., the 28x28 MNIST image).

2. DMA Transfer to the Accelerator

dma = overlay.axi_dma_0
dma.sendchannel.transfer(input_buffer)
dma.sendchannel.wait()
print("DMA transfer complete.")

overlay.axi_dma_0: Refers to the DMA instance configured in the FPGA bitstream (overlay.bit).
dma.sendchannel.transfer(input_buffer): Initiates the transfer of data from the allocated input buffer to the hardware accelerator via the DMA engine.
dma.sendchannel.wait(): Waits for the DMA transfer to complete before proceeding.

3. Freeing the Input Buffer

input_buffer.freebuffer()

After the transfer is complete, the input buffer is freed to release the allocated memory.

4. Output Data Retrieval

time.sleep(1)  # Small delay to ensure processing completion
offset = 0x8
value = mmio.read(offset)
print("Recognized Digit:", value)

time.sleep(1): Introduces a delay to allow the accelerator sufficient time to process the input data.
offset = 0x8: Specifies the register offset for reading the output of the accelerator. This value depends on your hardware's register map.
mmio.read(offset): Reads the result (recognized digit) from the specified register.
value: Contains the recognized digit output by the accelerator.

Read AXI MMRs in PYNQ

To read an MMIO (Memory-Mapped I/O) register in PYNQ, you can use the pynq.MMIO class. Here's a step-by-step guide:

1. Import the Required Module

from pynq import MMIO

2. Identify the MMIO Address and Size

You need the base address and the size of the MMIO region. These are typically provided in the hardware specification (e.g., from a block design in Vivado).

3. Create an MMIO Instance

Initialize an MMIO object with the base address and size:

mmio = MMIO(base_addr, size)

Replace base_addr with the starting physical address of the MMIO region.
Replace size with the size of the MMIO region in bytes.

4. Read from an MMIO Register

To read a value from a specific offset within the MMIO region:

value = mmio.read(offset)

Replace offset with the register's offset (in bytes) from the base address.

Give some time for CNN accelerator to complete processing.

Output

This output is expected (refer to the previous blog, where C was used to perform the same operation)