The DSP48 Primitive - Behavioral Symmetric FIR Inference
The DSP48 primitive has an optional preadder function, which can be used to compute things like PCOUT=PCIN+(A+D)*B, which when used for implementing symmetric or anti-symmetric FIRs can reduce the number of multipliers used in half.
The following diagram shows how such a symmetric FIR is built using the case N=4, a symmetric FIR with 8 taps as an example:
The forward data delay line is identical to the one for the non-symmetric FIR but now we have a second backward data delay line. While the pipelining of the P adder cascade chain forced us to increase the number of forward delays per tap from one to two, in the case of the backwards going delay line the number of delays per tap must be decreased from one to zero - all the D inputs of all the DSP48s are driven by a single version of the input data delayed by 2*N. While this is not scalable in the sense that as the number of taps increases this net will become a critical timing bottleneck, it is always possible to replicate the single 2*N delay line in a way that makes timing closure at full speed still possible. Symmetric and anti-symmetric FIRs are easy to implement since the pre-adder can also do subtraction. Another interesting variation is even or odd symmetric FIRs. The odd-symmetric of 2*N-1 taps case can be easily reduced to the even 2*N case by increasing the forward delay in the first DSP48 in the chain from one to two clocks and dividing the last coefficient by 2, so the two cases are not fundamentally different and can both be implemented with the same code, which would look like this:
library IEEE;
use IEEE.STD_LOGIC_1164.all;
use IEEE.NUMERIC_STD.all;
use work.types_pkg.all; – VHDL93 version of package providing SFIXED type support
entity SYMMETRIC_SYSTOLIC_FIR is
generic(N:INTEGER; ODD:BOOLEAN:=FALSE; -- if ODD the FIR orser is 2*N-1 else it is 2*N
ANTISYMMETRIC:BOOLEAN:=FALSE;
BEHAVIORAL:BOOLEAN:=TRUE);
port(CLK:in STD_LOGIC;
CI:in SFIXED_VECTOR; -- set of N symmetric coefficients, filter order is 2*N if even or 2*N-1 if odd - in this case set the middle coefficient to half the desired value
I:in SFIXED; -- forward data input
O:out SFIXED); -- filter output
end SYMMETRIC_SYSTOLIC_FIR;
architecture TEST of SYMMETRIC_SYSTOLIC_FIR is
signal ID:SFIXED(I'range);
begin
assert I'length<28 report "Input Data width must be 27 bits or less" severity warning; assert CI'length/N<19 report "Coefficient width must be 18 bits or less" severity warning;
sd:entity work.SDELAY generic map(SIZE=>2*N-1)
port map(CLK=>CLK,
I=>I,
O=>ID);
ib:if BEHAVIORAL generate
type TAC is array(0 to N) of SFIXED(I'range);
signal AC:TAC;
type TPC is array(0 to N) of SFIXED(I'high+(CI'high+1)/N+LOG2(N) downto I'low+CI'low/N);
signal PC:TPC;
begin
AC(AC'low)<=I;
PC(PC'low)<=(others=>'0');
lk:for K in 0 to N-1 generate
signal A1,A2,D,AD:SFIXED(I'range):=(others=>'0');
signal B:SFIXED((CI'high+1)/N-1 downto CI'low/N):=(others=>'0');
signal M:SFIXED(A2'high+B'high+1 downto A2'low+B'low):=(others=>'0');
signal P:SFIXED(PC(K+1)'range):=(others=>'0');
begin
process(CLK)
begin
if rising_edge(CLK) then
D<=ID;
if not ODD and K=0 then -- remove one A delay for the first tap if filter is even symmetric
A2<=AC(K);
else
A1<=AC(K);
A2<=A1;
end if;
if ANTISYMMETRIC then
AD<=RESIZE(D-A2,AD);
else
AD<=RESIZE(D+A2,AD);
end if;
B<=ELEMENT(CI,K,N); -- register for the coefficient inputs
M<=B*AD; -- multiplier internal register
P<=RESIZE(M+PC(K),PC(K+1)); -- post-adder output register
end if;
end process;
AC(K+1)<=A2; -- A cascade output
PC(K+1)<=P; -- P cascade output
end generate;
O<=RESIZE(PC(PC'high),O'high,O'low); -- truncate the final sum to match the O output port range
end generate;
end TEST;
The top level instantiation module is identical to the one we used for the non-symmetric FIR case in the previous post, we just have two more generics to select between even/odd and symmetric/anti-symmetric FIR structures.
Unfortunately this symmetric FIR example is also the case where we reach limits of behavioral inference. While the synthesis result is functionally correct, it is far from optimal. The symmetric FIR filter implementation should use just 4 DSP48s and a few LUTRAMs for the SDELAY but the pre-adders are not mapped in the DSP48s and use fabric carry chains instead. The ideal synthesis result should consist only of the five blocks highlighted, four DSP48E2s and the SDELAY module:
The DSP48 inference has its limits and the same rule we used before applies - if we get the expected results this is by far the best coding style, it is the most compact, easier to understand and maintain and to a certain extent even portable, but if we do not get the result that we want then primitive instantiations are the way to go. There are of course some drawbacks when we use this design flow and the DSP48 primitive with its 50 generics and 50 ports is an extreme example for that.
In the next post I will introduce a generic wrapper for the DSP48E2 primitive in an attempt to make instantiation easier and combine the advantages of the two design flows.
Back to the top: The Art of FPGA Design
Top Comments