The Art of FPGA Design - Post 24

30 Apr 2019

The DSP48 Primitive - FIR with DSP48 Primitive Instantiations

After a break I will be resuming my weekly posts on The Art of FPGA Design. In the last posts we started looking at the DSP48 primitive, essentially a signed 27x18 multiplier which also includes a 27-bit pre-adder and a 3-input 48-bit post-adder. In older FPGA families like the 7-series the multiplier is 25x18, the pre-adder is 25-bits and the 48-bit post-adder has only two inputs. The DSP48 primitive also includes a lot of pipeline registers and dedicated input and output cascade ports and routing resources which makes the implementation of sum of products operations like FIRs for example extremely efficient. In particular FIRs of any order that use just DSP48s and little or even none fabric resources can be built that run at the maximum data sheet clock frequency, which depending on FPGA family and speed grade can be as high as 891MHz. As today's FPGAs can have from hundreds to over ten thousand DSP48s there is a lot of real-time digital signal processing that can be done and FPGAs tend to outperform equivalent implementations on CPUs, software DSPs or GPUs, especially when cost and power dissipation are also considered.

So in Post 20 I introduced the DSP48 primitive and in Post 21 and Post 22 we have seen examples of non-symmetric respectively symmetric FIR inference from behavioral code. Post 23 discusses the pros and cons of behavioral inference versus primitive instantiations and introduces an instantiation wrapper for DSP48E2 primitives called DSP48E2GW, which we will use extensively in the next few posts.

In this post we will see how to build a non-symmetric FIR by instantiating DSP48 primitives - you may want to review Post 21 and Post 23 before moving forward.

In Post 21 we have seen a generic behavioral or inferred non-symmetric FIR called SYSTOLIC_FIR. It is generic in the sense that the filter order N is a generic parameter and can have any value, and also the input, output and coefficients are unconstrained SFIXED respectively SFIXED_VECTOR ports, they can be arbitrary precision fixed point signals of any size and with any binary point position (as long as they stay within the maximum size allowed by the actual DSP48E2 ports). In particular the input and output ports are not necessarily of the same type and the design handles binary point alignment and resizing automatically. The design also had a generic called BEHAVIORAL, with a default value of TRUE. We will extend the Post 21 example with an alternative implementation, which instantiates DSP48s using the DSP48E2GW generic wrapper introduced in Post 23. We can then use the BEHAVIORAL generic to select between DSP48 inference and instantiation within the same design:

library IEEE; 
use IEEE.STD_LOGIC_1164.all; 
use IEEE.NUMERIC_STD.all;  
use work.types_pkg.all; -- VHDL93 version of package providing SFIXED type support  

entity SYSTOLIC_FIR is
  generic(N:INTEGER; 
          BEHAVIORAL:BOOLEAN:=TRUE);
  port(CLK:in STD_LOGIC; 
       CI:in SFIXED_VECTOR; -- set of SIZE coefficients 
       I:in SFIXED;         -- forward data input 
       O:out SFIXED);       -- filter output 
end SYSTOLIC_FIR;  

architecture TEST of SYSTOLIC_FIR is 
begin
  assert I'length<28 report "Input Data width must be 27 bits or less" severity warning;
  assert CI'length/N<19 report "Coefficient width must be 18 bits or less" severity warning;

  ib:if BEHAVIORAL generate
       type TAC is array(0 to N) of SFIXED(I'range);
       signal AC:TAC;
       type TPC is array(0 to N) of SFIXED(I'high+(CI'high+1)/N+LOG2(N) downto I'low+CI'low/N);
       signal PC:TPC;
     begin 
       AC(AC'low)<=I; PC(PC'low)<=(others=>'0');
       lk:for K in 0 to N-1 generate
            signal A1,A2:SFIXED(I'range):=(others=>'0');
            signal B:SFIXED((CI'high+1)/N-1 downto CI'low/N):=(others=>'0');
            signal M:SFIXED(A2'high+B'high+1 downto A2'low+B'low):=(others=>'0');
            signal P:SFIXED(PC(K+1)'range):=(others=>'0');
          begin
            process(CLK)
            begin
              if rising_edge(CLK) then
                if K=0 then -- for the first tap the A cascade delay is one clock 
                  A2<=AC(K);
                else        -- for all the other taps the A cascade delay is two clocks 
                  A1<=AC(K);
                  A2<=A1;
                end if;
                  B<=ELEMENT(CI,K,N);    -- register for the coefficient inputs 
                  M<=B*A2;               -- multiplier internal register 
                  P<=RESIZE(M+PC(K),PC(K+1)); -- post-adder output register
              end if;
            end process;
            AC(K+1)<=A2; -- A cascade output 
            PC(K+1)<=P;  -- P cascade output
          end generate;
          O<=RESIZE(PC(PC'high),O'high,O'low); -- truncate the final sum to match the O output port range
     end generate;

  ip:if not BEHAVIORAL generate
       type TAC is array(0 to N) of STD_LOGIC_VECTOR(29 downto 0);
       signal AC:TAC; -- A cascade
       type TPC is array(0 to N) of STD_LOGIC_VECTOR(47 downto 0);
       signal PC:TPC; -- P cascade
     begin 
       AC(AC'low)<=(others=>'0');
       PC(PC'low)<=(others=>'0');
       lk:for K in 0 to N-1 generate
            signal D:SFIXED(I'range):=(others=>'0');
            signal C:SFIXED((O'high+1)/N-1 downto O'low/N):=(others=>'0');
            signal P:SFIXED(I'high+(CI'high+1)/N+LOG2(N) downto I'low+CI'low/N);
            signal INMODE:STD_LOGIC_VECTOR(4 downto 0);
            signal OPMODE:STD_LOGIC_VECTOR(8 downto 0);

            function A_INPUT(K:INTEGER) return STRING is
            begin
              if K=0 then
                return "DIRECT";
              else return "CASCADE";
            end if;
            end;

            function AREG(K:INTEGER) return INTEGER is
            begin
              if K=0 then
                return 1; -- for the first tap the A cascade delay is one clock
              else
                return 2; -- for all the other taps the A cascade delay is two clocks
              end if;
            end;
          begin 
            OPMODE<="110000101" when K=0 else "110010101"; -- P=C+A*B when K=0 else P=C+PCIN+A*B 
            ds:entity work.DSP48E2GW generic map(A_INPUT=>A_INPUT(K),
                                                 AREG=>AREG(K))
                                     port map(CLK=>CLK,
                                              A=>I,
                                              B=>ELEMENT(CI,K,N),
                                              C=>C, -- zero 
                                              D=>D, -- zero 
                                              ACIN=>AC(K),
                                              PCIN=>PC(K),
                                              OPMODE=>OPMODE,
                                              ACOUT=>AC(K+1),
                                              PCOUT=>PC(K+1),
                                              P=>P);
            i0:if K=N-1 generate 
                 O<=RESIZE(P,O'high,O'low);
               end generate;
          end generate;
     end generate; 
end TEST;

The first generate, when BEHAVIORAL=TRUE, is just a repeat of the code in Post 21. The second generate, when BEHAVIORAL=FALSE is the new structural version with DSP48E2 primitive instantiations. While the code is less readable compared with the inference version - you really need to understand what all the DSP48E2 ports and generics do, meaning you need to actually open UG579 and read it - it is not much larger. The DSP48E2GW generic wrapper reduces drastically the number of generics and ports that need to be explicitly defined and connected and the A, B, C, D and P ports are the same unconstrained SFIXED types as those used in the behavioral version. The two functions A_INPUT and AREG show how to make DSP48 generics depend on the position of the primitive in the chain, the configuration of the first one (when K=0) is slightly different from all the others.

The top level design used to test this implementation is the same TEST_SYSTOLIC_FIR from Post 21, we just have to change the default value of the BEHAVIORAL generic in the TEST_SYSTOLIC_FIR entity definition from TRUE to FALSE to get the second version. The final result should be the same in both cases.

For this particular design, where the inferred version actually does what is expected, the structural version does not provide any significant advantage but this is a good exercise to illustrated the two coding techniques side by side. When inference from behavioral code does not produce optimal results then the structural coding style with DSP48E2 primitive instantiations is the better approach.

In the next post I will provide the structural version for the symmetric FIR case and then we will start looking at more examples of DSP48 use.

Back to the top: The Art of FPGA Design