The Art of FPGA Design - Post 29

5 Jun 2019

The DSP48 Primitive - Complex multipliers

Traditionally a complex multiplication can be decomposed into four real multiplications and two additions:

x+i·y=(a+i·b)(c+i·s)=(a·c−b·s)+i·(a·s+b·c) where i=√−1

This maps well into four DSP48s, including the two additions, which can use the post-adders and the DSP48 P cascades. The latency of a fully pipelined DSP48 implementation is four clocks. For intensive signal processing applications, where the DSP48s are a scarce resource, it is important to use them as efficiently as possible so the ability to compute a complex multiply with less than four DSP48s is significant.

There is an algorithm, initially developed for high precision real multiplications by Anatoly Karatsuba in the early 1960s, but which also works for complex multiplications, which only requires three real multiplications. The basic idea is to compute an intermediate product term, which at the cost of extra additions reduces the number of real multiplications by one.

The original Karatsuba algorithm first computes three partial products:

u=a·c

v=b·s

w=(a+b)·(c+s)

The final result then is:

x=u−v

y=w−u−v

The Karatsuba algorithm achieves a reduction in the number of real multiplications at the cost of an increase in the number of adders, from two to five. Unfortunately this original form of the Karatsuba algorithm is not well suited for a DSP48 implementation, it would require two pre-adders in the same DSP48 for computing w and also two post adders for y - neither one is an option. A slightly modified version however leads to a more efficient implementation:

w=b·(c+s)

x=(a+b)·c−w

y=(a−b)·s+w

Now all the three extra additions can be mapped into the DSP48s pre-adders, so with the exception of a few fabric based pipeline registers no other fabric resources are required. The real cost of a Karatsuba style implementation is a total latency of five clocks instead of four and a loss of precision due to the use of pre-adders - while the classic four DSP48 implementation can work with operands up to 27x18 bits, the Karatsuba one is limited to 26x17 bits.

However, at the cost of a few fabric adders it is possible to restore the full 27x18 precision. The trick is to feed into the pre-adders operands shifted right by one bit. Based on the values of the two LSBs, a correction term must be added to the post adder, which can be done through the DSP48 C input port.

The Karatsuba algorithm is not the only way a reduction in the number of real multiplications required to implement a complex multiplication can be achieved. In many cases, the second complex operand is a complex exponential, where c and s are the cosine and sine of some angle, usually taken from a precomputed table - the complex multiplication then becomes a rotation in the complex plane. Good examples for this case are complex mixers and complex multiplications in FFT implementations. For these applications a different technique, called lifting can be used:

u=a−(1+s)c ·b

x=b+c·u

y=(1+s)c ·x−u

It is easy to verify that this indeed computes a complex plane rotation:

x=b+c·(a−(1+s)c ·b)=b+c·a−(1+s)·b=c·a−s·b

y=(1+s)c ·(c·a−s·b)−a+(1+s)c ·b=(1+s)·a−a−s·(1+s)c ·b+(1+s)c ·b=s·a+(1−s)·(1+s)c ·b=s·a+c·b , where we used the fact that c²+s²=1 .

This complex multiplier architecture, which only works when c²+s²=1 , uses the least number of operations, three multipliers and three adders. The DSP48 pre-adders are not being used so there is no loss of operand size due to that. The c and s values are normally held in Sin/Cos look-up tables, but for the lifting complex multiplier precomputed values for c and (1+s)/c need to be stored instead.

We will conclude the series of posts with a look at the new Versal FPGA family DSP58 version less of the DSP48 primitive.

Back to the top: The Art of FPGA Design The Art of FPGA Design

hcommin 4 months ago

The "lifting" technique sounds very interesting, but I can't imagine storing (1+s)/c in a LUT (directly) because that's going to explode towards +/- infinity around pi/2. There might be some clever trick to avoid needing to store the huge values, but it's not immediately obvious to me how to do it.

I also wonder about the memory requirement. Assuming the full sine period is needed (like in a mixer, but not an FFT), sine and cosine can be easily generated from just 1/4 of a sine period (or, less elegantly, from 1/8 of a sine and 1/8 of a cosine) stored in a LUT. Do similar symmetries exist in the "lifting" method, which could be exploited to reduce the memory requirement?

And, of course, I wonder about the numerical accuracy. How does the fixed-point error of "lifting" compare with an ordinary sine LUT (with the same word width)?

My gut feeling is that there is a heavy price to pay in order to save a couple of adders. I really hope I'm wrong because it's a fascinating idea.
- Cancel
- Vote Up 0 Vote Down
- Sign in to reply
- More
- Cancel