The New DSPFP32 Primitive in Versal FPGAs

19 Dec 2023

The New DSPFP32 Primitive in Versal FPGAs

The DSP primitive in the latest Versal FPGA family is called DSP58 and it already has a number of improvements over the latest DSP48 flavors, mainly an increase from 27x18 signed multiplier and 48-bit post adder to 27x24 and 58 bits. But on top of that there are two more operating modes of the DSP58 called DSPCPLX and DSPFP32. The last one, a hardened floating point adder and multiplier will make the object of this post.

The DSPFP32 includes a single precision floating point adder and multiplier. They can be used either independently or combined as a multiply-accumulate operation.

The following diagram shows the internal architecture of the DSPFP32:

The DSPFP32 is somehow similar to the DSP58, the real differences, apart from using single precision floating point vs. fixed point, are the fact the we have now two outputs, FPA and FPM, instead of just the post-adder P port, and that there is no pre-adder. This diagram shows the FP32 adder and multiplier used independently and the color highlighting indicates the minimum amount of pipelining required to achieve the maximum possible speed of 805MHz. You basically get a latency 2 FP32 adder and a latency 3 multiplier in every DSP58. The signs of both input operands for the adder can be optionally inverted, there is a wide selection for these operands, ZERO, C, D and PCIN inputs, as well as the FPA output itself, which can be used to build accumulators. The PCIN/PCOUT cascade chain lets you cascade multiple DSPFP32 adders and build sums of more than two terms. If you connect the FPA output externally to the B input using fabric routing you can compute something like FPM=A*(C+D) with a latency of 5 clocks.

The second image shows the FP32 multiplier and adder connected internally as a MAC, so FPA=C+A*B or FPA=FPA+A*B can be computed with a latency of 4 clocks. The optional extra pipeline registers in the C and FPOPMODE input paths can be used to compensate for the extra latency of the multiplier path so that the entire MAC has a total latency of 4 clocks for all its data inputs.

Although not shown in these diagrams, both FPA and FPM can be routed to the PCOUT port, so using the P cascade output to borrow one multiplier from a neighboring DSP you can also compute FPA=C+A1*B1+A2*B2 in four clocks of latency, so a full complex multiplier plus a complex adder can be built with 4 DSPFP32s and no other fabric resources.

Floating point designs were always possible in earlier FPGA families, Xilinx has provided fabric based soft floating point IP for years, but the hardened DSPFP32 offers now that option using a single DSP58 primitive and virtually no fabric resources, with much lower latency (3-4 clocks instead of 8-11), lower power consumption and clock speeds up to 805MHz in the fastest two speed grades.

In the third and final post of this short series on the new DSPFP32 I will discuss how this primitive can be instantiated and used efficiently in HDL designs.

Back to the top: The Art of FPGA Design

Parents

javagoza 11 months ago

This post just increased my interest in the Versal family. Let's see if I have the opportunity to go deeper in a practical way in the future. Thank you, I follow your series of posts very carefully.
- Cancel
- Vote Up 0 Vote Down
- Sign in to reply
- More
- Cancel
michaelkellett 11 months ago in reply to javagoza

Wish you luck. Cheapest available part on Digikey is £6600 for one. It has1596 pins so its going to need quite a pcb to work.

Out of my range (by a long way).

MK
- Cancel
- Vote Up 0 Vote Down
- Sign in to reply
- More
- Cancel
javagoza 11 months ago in reply to michaelkellett

I go through life as a naïve person. In 2021 I participated in the Adaptive Computing Challenge 2021 - Hackster.io I did not get a Versal but I did get a Xilinx Kria KV260 Vision AI Starter Kit, but there was the possibility of getting a VCK5000 Versal Development Card (xilinx.com) I could have competed for one in a field that I master, AI application for fraud control in financial transactions since I work for a bank but I prefer to do more fun things in my free time and applied for the Kria.
- Cancel
- Vote Up 0 Vote Down
- Sign in to reply
- More
- Cancel
fpgaguru 11 months ago in reply to michaelkellett

I do not work in sales so I am definitely not qualified to discuss prices, but historically speaking, all new FPGA families start with very high prices and as they become mainstream the prices become more competitive.

Another point worth making, there are multiple sub-families within Versal, ranging from very large parts in the Prime, Premium and HBM series to smaller, lower cost parts in the AI RF, AI Core and AI Edge series. In particular, the AI Edge series targets low cost edge applications and contains really small devices like VE2002 which I expect would be much more accessible.

The main idea is that starting with Versal, all FPGAs in all sub-families have now single precision floating point capabilities hardened in every DSP58.
- Cancel
- Vote Up 0 Vote Down
- Sign in to reply
- More
- Cancel

Comment

fpgaguru 11 months ago in reply to michaelkellett

I do not work in sales so I am definitely not qualified to discuss prices, but historically speaking, all new FPGA families start with very high prices and as they become mainstream the prices become more competitive.

Another point worth making, there are multiple sub-families within Versal, ranging from very large parts in the Prime, Premium and HBM series to smaller, lower cost parts in the AI RF, AI Core and AI Edge series. In particular, the AI Edge series targets low cost edge applications and contains really small devices like VE2002 which I expect would be much more accessible.

The main idea is that starting with Versal, all FPGAs in all sub-families have now single precision floating point capabilities hardened in every DSP58.
- Cancel
- Vote Up 0 Vote Down
- Sign in to reply
- More
- Cancel

Children

michaelkellett 11 months ago in reply to fpgaguru

I can't find the VE2002 listed by a distributor, but the V2102 is (by Mouser) at £336 but no stock.

It's the second smallest in the AI Edge series (I think) .

It has quite a lot of FP blocks (176).

But I think we have rather different perspectives on "low cost edge" applications - my current design project has a <£10 FPGA (with 0 FP blocks )

MK
- Cancel
- Vote Up 0 Vote Down
- Sign in to reply
- More
- Cancel