I had some email me with this question:
As I experiment with reconfigurable risc v cores, is there tensor acceleration for LLM and diffusion network ai models on the FPGA fabric?
I had some email me with this question:
As I experiment with reconfigurable risc v cores, is there tensor acceleration for LLM and diffusion network ai models on the FPGA fabric?
Hi,
Although it's technically possible for someone to implement this, I don't think it would be very efficient to do this in the FPGA cells. There are better options, which would be (for instance) to use an AMD Zynq type part (i.e. with processing subsystem) and attach an accelerator module, using say AMD PCIe IP https://www.xilinx.com/products/technology/pci-express.html . Example is Google's accelerator module, but actually obtaining that could be a challenge (due to semiconductor shortage). I've been waiting 4 months just to get hold of a small quantity (3) of the Google accelerators, and no promised date. Maybe it's easier to obtain if there is a requirement for large volumes perhaps, but I don't know : (
I divide answer in 2 parts, one the solution from AVNET and the second from other manufacturer.
Firstly, I take other manufacturer. I assume you have pre-existing RISC-V ISA which you want to implement on FPGA for softcore processor implementation.
For real tensor acceleration, you can try NVIDIA which have even 80GB memory bandwidth at over 2 TB/s. Not only this, they have dedicated Transformer Engine with XLA for ML compiler, which improves performance and batch size. NVIDIA GPU is known to support JAX or LLM .
Now coming to AVNET, Zynq Ultrascale+ MPSoC.
AVNET FPGA fabric doesnt look to have Hopper architecture but AVNET FPGA , Zynq Ultrascale+ MPSoC do have strong GPU, like I using Ultra96-V2-G. it has GPU based on the Arm® MaliTM-400 MP2, and has 3 processors - One geometry processor (GP) and Two pixel processors (PP). It has OpenGL ES 1.1 and 2.0,OpenVG 1.1 , SIMD engine ,4-way 32-bit simultaneous instruction execution and Vertex loader DMA unit.
AVNET Zynq Ultrascale+ MPSoC do has Deep Learning Processor Unit (DPU) on FPGA fabric !. You have to create IP.
You can use Vivado to create a programming Logic (PL) or PetaLinux to use FPGA fabric of Deep Learning Processor Unit. This DPU can be used in conjunction with AVNET FPGA , Zynq Ultrascale+ MPSoC as co-processor, thus, DPU becomes slave co-processor in PL of Zynq Ultrascale+, which is controlled by Arm Cortex-A53’s in the Processing System (PS) . This Ultra96-V2-G has quad cortex A53. But I will advise you to use more powerful like ZCU102 or Zynq-7000 SoC or ZCU104 or ZedBoard or other Zynq UltraScale+TM MPSoC , instead of ultra96-v2-g which is XCZU3EG.
With this Zynq Ultrascale+ MPSoC . you can use AXI interconnect to perform deep learning like image classification, object detection, and semantic segmentation.
On Chip FPGA fabric of Zynq Ultrascale+ MPSoC , has Deep Compression Tool (DECENT), Deep Neural Network Compiler (DNNC), Neural Network Runtime (N2 Cube), and DPU Profiler.
You will have to check if you need specific Transformer Engine or you want to achieve and use with what application.
With AVNET Zynq Ultrascale+ MPSoC , you can do projects like Autonomous driving, turn steering wheel, Cancer research etc. Cancer research like the detecion with Gaussian is done, but I dont know , you may be able to do diffusion with Gaussian with variances, Poisson and speckle with variances; double precision, methods of Sobel,Prewitt and Robert; dehazing algorithm, pixel intensities,vector quantization, k-means clustering, fuzzy logic and morphological segmentation.
Abhishek Bansal