The Art of FPGA Design - Post 22

4 Dec 2018

The DSP48 Primitive - Behavioral Symmetric FIR Inference

The DSP48 primitive has an optional preadder function, which can be used to compute things like PCOUT=PCIN+(A+D)*B, which when used for implementing symmetric or anti-symmetric FIRs can reduce the number of multipliers used in half.

The following diagram shows how such a symmetric FIR is built using the case N=4, a symmetric FIR with 8 taps as an example:

The forward data delay line is identical to the one for the non-symmetric FIR but now we have a second backward data delay line. While the pipelining of the P adder cascade chain forced us to increase the number of forward delays per tap from one to two, in the case of the backwards going delay line the number of delays per tap must be decreased from one to zero - all the D inputs of all the DSP48s are driven by a single version of the input data delayed by 2*N. While this is not scalable in the sense that as the number of taps increases this net will become a critical timing bottleneck, it is always possible to replicate the single 2*N delay line in a way that makes timing closure at full speed still possible. Symmetric and anti-symmetric FIRs are easy to implement since the pre-adder can also do subtraction. Another interesting variation is even or odd symmetric FIRs. The odd-symmetric of 2*N-1 taps case can be easily reduced to the even 2*N case by increasing the forward delay in the first DSP48 in the chain from one to two clocks and dividing the last coefficient by 2, so the two cases are not fundamentally different and can both be implemented with the same code, which would look like this:


library IEEE; 
use IEEE.STD_LOGIC_1164.all; 
use IEEE.NUMERIC_STD.all;  

use work.types_pkg.all; – VHDL93 version of package providing SFIXED type support  

entity SYMMETRIC_SYSTOLIC_FIR is
  generic(N:INTEGER; ODD:BOOLEAN:=FALSE; -- if ODD the FIR orser is 2*N-1 else it is 2*N 
          ANTISYMMETRIC:BOOLEAN:=FALSE;
          BEHAVIORAL:BOOLEAN:=TRUE);
  port(CLK:in STD_LOGIC;
       CI:in SFIXED_VECTOR; -- set of N symmetric coefficients, filter order is 2*N if even or 2*N-1 if odd - in this case set the middle coefficient to half the desired value 
       I:in SFIXED;         -- forward data input 
       O:out SFIXED);       -- filter output 
end SYMMETRIC_SYSTOLIC_FIR;  

architecture TEST of SYMMETRIC_SYSTOLIC_FIR is
  signal ID:SFIXED(I'range); 
begin
  assert I'length<28 report "Input Data width must be 27 bits or less" severity warning; assert CI'length/N<19 report "Coefficient width must be 18 bits or less" severity warning;

  sd:entity work.SDELAY generic map(SIZE=>2*N-1)
                        port map(CLK=>CLK,
                                 I=>I,
                                 O=>ID);

  ib:if BEHAVIORAL generate
       type TAC is array(0 to N) of SFIXED(I'range);
       signal AC:TAC;
       type TPC is array(0 to N) of SFIXED(I'high+(CI'high+1)/N+LOG2(N) downto I'low+CI'low/N);
       signal PC:TPC;
     begin 
       AC(AC'low)<=I;
       PC(PC'low)<=(others=>'0');
       lk:for K in 0 to N-1 generate
            signal A1,A2,D,AD:SFIXED(I'range):=(others=>'0');
            signal B:SFIXED((CI'high+1)/N-1 downto CI'low/N):=(others=>'0');
            signal M:SFIXED(A2'high+B'high+1 downto A2'low+B'low):=(others=>'0');
            signal P:SFIXED(PC(K+1)'range):=(others=>'0');
          begin
            process(CLK)
            begin
              if rising_edge(CLK) then 
                D<=ID;
                if not ODD and K=0 then -- remove one A delay for the first tap if filter is even symmetric 
                  A2<=AC(K);
                else 
                  A1<=AC(K);
                  A2<=A1;
                end if;
                if ANTISYMMETRIC then 
                  AD<=RESIZE(D-A2,AD);
                else 
                  AD<=RESIZE(D+A2,AD);
                end if;
                B<=ELEMENT(CI,K,N); -- register for the coefficient inputs 
                M<=B*AD;            -- multiplier internal register 
                P<=RESIZE(M+PC(K),PC(K+1)); -- post-adder output register
              end if;
            end process;
            AC(K+1)<=A2; -- A cascade output 
            PC(K+1)<=P;  -- P cascade output
          end generate;
       O<=RESIZE(PC(PC'high),O'high,O'low); -- truncate the final sum to match the O output port range
     end generate; 
end TEST;

The top level instantiation module is identical to the one we used for the non-symmetric FIR case in the previous post, we just have two more generics to select between even/odd and symmetric/anti-symmetric FIR structures.

Unfortunately this symmetric FIR example is also the case where we reach limits of behavioral inference. While the synthesis result is functionally correct, it is far from optimal. The symmetric FIR filter implementation should use just 4 DSP48s and a few LUTRAMs for the SDELAY but the pre-adders are not mapped in the DSP48s and use fabric carry chains instead. The ideal synthesis result should consist only of the five blocks highlighted, four DSP48E2s and the SDELAY module:

The DSP48 inference has its limits and the same rule we used before applies - if we get the expected results this is by far the best coding style, it is the most compact, easier to understand and maintain and to a certain extent even portable, but if we do not get the result that we want then primitive instantiations are the way to go. There are of course some drawbacks when we use this design flow and the DSP48 primitive with its 50 generics and 50 ports is an extreme example for that.

In the next post I will introduce a generic wrapper for the DSP48E2 primitive in an attempt to make instantiation easier and combine the advantages of the two design flows.

Back to the top: The Art of FPGA Design

Top Comments

DAB over 6 years ago +1

Nice post. It would be really nice if you would run the filter on a waveform and walk us through how the filter works and clarify why this instantiation is better than an analog alternative. DAB

fpgaguru over 6 years ago in reply to DAB

Why DSP48 primitive instantiation is better and when it should be used will make the object of the next few blog posts, so keep reading ;-).

As for general Digital Signal Processing theory and how these filters work, that could be the object of a full other blog, maybe after I complete this one I will do that but I plan on at least 50 weekly posts on The Art of FPGA Design so it will be a while. In the meantime here is a list of free DSP books that could be used to get familiar with the field:

Entry level, DSP with Python:
http://greenteapress.com/thinkdsp/thinkdsp.pdf

More advanced:
http://ptgmedia.pearsoncmg.com/images/9780137027415/samplepages/0137027419.pdf

Big, engineering oriented:
https://users.dimi.uniud.it/~antonio.dangelo/MMS/materials/Guide_to_Digital_Signal_Process.pdf

Rick Lyons, the author of the Understanding Digital Signal Processing books https://www.amazon.com/Richard-G.-Lyons/e/B00IZ4NJGK/ref=dp_byline_cont_book_1 maintains a list of free DSP resources here (it's 10 years old but still relevant):
https://www.dsprelated.com/showarticle/56.php
- Cancel
- Vote Up 0 Vote Down
- Sign in to reply
- More
- Cancel
DAB over 6 years ago

Nice post.

It would be really nice if you would run the filter on a waveform and walk us through how the filter works and clarify why this instantiation is better than an analog alternative.

DAB
- Cancel
- Vote Up +1 Vote Down
- Sign in to reply
- More
- Cancel