The New DSPFP32 Primitive in Versal FPGAs
The DSP primitive in the latest Versal FPGA family is called DSP58 and it already has a number of improvements over the latest DSP48 flavors, mainly an increase from 27x18 signed multiplier and 48-bit post adder to 27x24 and 58 bits. But on top of that there are two more operating modes of the DSP58 called DSPCPLX and DSPFP32. The last one, a hardened floating point adder and multiplier will make the object of this post.
The DSPFP32 includes a single precision floating point adder and multiplier. They can be used either independently or combined as a multiply-accumulate operation.
The following diagram shows the internal architecture of the DSPFP32:
The DSPFP32 is somehow similar to the DSP58, the real differences, apart from using single precision floating point vs. fixed point, are the fact the we have now two outputs, FPA and FPM, instead of just the post-adder P port, and that there is no pre-adder. This diagram shows the FP32 adder and multiplier used independently and the color highlighting indicates the minimum amount of pipelining required to achieve the maximum possible speed of 805MHz. You basically get a latency 2 FP32 adder and a latency 3 multiplier in every DSP58. The signs of both input operands for the adder can be optionally inverted, there is a wide selection for these operands, ZERO, C, D and PCIN inputs, as well as the FPA output itself, which can be used to build accumulators. The PCIN/PCOUT cascade chain lets you cascade multiple DSPFP32 adders and build sums of more than two terms. If you connect the FPA output externally to the B input using fabric routing you can compute something like FPM=A*(C+D) with a latency of 5 clocks.
The second image shows the FP32 multiplier and adder connected internally as a MAC, so FPA=C+A*B or FPA=FPA+A*B can be computed with a latency of 4 clocks. The optional extra pipeline registers in the C and FPOPMODE input paths can be used to compensate for the extra latency of the multiplier path so that the entire MAC has a total latency of 4 clocks for all its data inputs.
Although not shown in these diagrams, both FPA and FPM can be routed to the PCOUT port, so using the P cascade output to borrow one multiplier from a neighboring DSP you can also compute FPA=C+A1*B1+A2*B2 in four clocks of latency, so a full complex multiplier plus a complex adder can be built with 4 DSPFP32s and no other fabric resources.
Floating point designs were always possible in earlier FPGA families, Xilinx has provided fabric based soft floating point IP for years, but the hardened DSPFP32 offers now that option using a single DSP58 primitive and virtually no fabric resources, with much lower latency (3-4 clocks instead of 8-11), lower power consumption and clock speeds up to 805MHz in the fastest two speed grades.
In the third and final post of this short series on the new DSPFP32 I will discuss how this primitive can be instantiated and used efficiently in HDL designs.
Back to the top: The Art of FPGA Design