The Art of FPGA Design - Post 27

21 May 2019

The DSP48 Primitive - Small Multiplications - Two For the Price of One

The DSP48E2 primitive contains a signed 27x18 multiplier, any signed multiplier up to this size can be implemented with just one such primitive. Unsigned multiplications are of course possible if you add a zero MSB bit to the operands and then treat them as signed but the largest unsigned multiplication that can be done that way with one DSP48E2 is 26x17.

When the operands are much smaller it becomes possible to implement two signed multiplications with a single DSP48E2. The largest signed operand sizes for this is possible is 9x9 and a further restriction is that one operand must be shared between the two multipliers. So we can do OA=IA*IC and OB=IB*IC at the same time with one DSP48E2, where IA, IB and IC are signed, 9-bit numbers.

The basic idea is to concatenate IB and IA, separated by 9 zero bits as the DSP48 27-bit A input and sign extend IC as the 18-bit DSP48E2 B input. The two OA and OB 18-bit signed results can then be extracted from the DSP48E2 48-bit output. If the signed inputs IA and IC are positive, or if we interpret the 9-bit IA, IB and IC operands as unsigned this actually works really well. However, when IA or IC are negative then OB and even OA will produce incorrect results. This can be however corrected using the DSP48E2 C input port - the situations that generate incorrect results are detected and a non-zero correction value is applied to the C port, which added to the A*B product then produces the expected output.

As was the case with earlier examples, a USE_LUTs generic is used to select between a behavioral, fabric based implementation or the DSP48E2 based one:

library IEEE; 
use IEEE.STD_LOGIC_1164.all; 
use IEEE.NUMERIC_STD.all;  

use work.types_pkg.all;  

entity DUAL_MUL9x9 is
  generic(USE_LUTs:BOOLEAN:=FALSE);
  port(CLK:in STD_LOGIC;
       IA,IB,IC:in SFIXED(8 downto 0);
       OA,OB:out SFIXED(17 downto 0)); 
end DUAL_MUL9x9;  

architecture TEST of DUAL_MUL9x9 is 
begin 
  i0:if USE_LUTs generate
       signal A1:SFIXED(IA'range):=(others=>'0');
       signal B1:SFIXED(IB'range):=(others=>'0');
       signal C1:SFIXED(IC'range):=(others=>'0');
       signal MA:SFIXED(IA'high+IC'high+1 downto IA'low+IC'low):=(others=>'0');
       signal MB:SFIXED(IB'high+IC'high+1 downto IB'low+IC'low):=(others=>'0');
     begin
       process(CLK)
       begin
         if rising_edge(CLK) then 
           A1<=IA;
           B1<=IB;
           C1<=IC;
           MA<=A1*C1;
           MB<=B1*C1;
           OA<=RESIZE(MA,OA);
           OB<=RESIZE(MB,OB);
         end if;
       end process;
     end;
     else i1: generate
       signal A:SFIXED(26 downto 0);
       signal B:SFIXED(17 downto 0);
       signal C:SFIXED(47 downto 0):=(others=>'0');
       signal D:SFIXED(26 downto 0);
       signal P:SFIXED(47 downto 0);
     begin 
       A<=IB&"000000000"&IA;
       B<=RESIZE(IC,B);
       D<=TO_SFIXED(0.0,D); 
-- correction term applied to the DSP48E2 C input port
       process(CLK)
       begin
         if rising_edge(CLK) then
           if IA(IA'high)='1' then 
             C<=RESIZE(-SHIFT_LEFT(IC,9),C);
             C(47 downto 18)<=30x"000000000";
           elsif IC(IC'high)='1' then
             if IA=TO_SFIXED(0.0,IA) then 
               C<=48x"000000000000";
             else 
               C<=48x"000000040000";
             end if;
           else 
             C<=48x"000000000000";
           end if;
         end if;
       end process;

       dx:entity work.DSP48E2GW port map(CLK=>CLK,
                                         INMODE=>"10001",     -- use A1 and B1 
                                         OPMODE=>"000110101", -- P<=C+A*B 
                                         A=>A,
                                         B=>B,
                                         C=>C,
                                         D=>D,
                                         P=>P);
       OA<=P(17 downto 0);
       OB<=P(35 downto 18);
     end;
     end generate;
end TEST;

At the cost of 8 LUTs and 10 FFs we were able to implement two 9x9 signed multiplications with a shared input with a single DSP48E2. The fabric based version uses 198 LUTs and 141 FFs and they are functionally equivalent. Both versions can run at the maximum clock speed permitted by the data sheet for a particular FPGA family and speed grade. If the 9-bit operands are unsigned then no C correction factor is necessary.

One final point to make here - the design is not trivial, it is not obvious what exact correction values are required for various signed operand values and it is easy to misconfigure the DSP48E2 primitive. The most insidious problems are rare corner cases, when the design seems to work well but still fails for some very particular input value combinations. To make sure the design is correct I created a tesbench that does an exhaustive comparison with a golden model for all the 2**27 possible input values. This testbench is a good example on how to achieve such type of a testing, so here it is:

library IEEE; 
use IEEE.STD_LOGIC_1164.ALL;  

library STD; 
use STD.textio.all; 
use STD.env.all;  

use work.types_pkg.all;  

entity TB_DUAL_MUL9x9 is
  generic(USE_LUTs:BOOLEAN:=FALSE); 
end TB_DUAL_MUL9x9;  

architecture TB of TB_DUAL_MUL9x9 is
  signal CLK:STD_LOGIC:='1';
  signal IA,IB,IC:SFIXED(8 downto 0):=(others=>'0');
  signal OA,OB:SFIXED(17 downto 0);
  signal OA1,OB1,OA2,OB2,OA3,OB3:SFIXED(17 downto 0):=(others=>'0'); 
begin 
  CLK<=not CLK after 1.0 ns; 
-- input stimulus
  process
  begin
    wait until rising_edge(CLK);
    for AI in -256 to 255 loop 
      IA<=TO_SFIXED(REAL(AI),IA);
      for BI in -256 to 255 loop 
        IB<=TO_SFIXED(REAL(BI),IB);
        for CI in -256 to 255 loop 
          IC<=TO_SFIXED(REAL(CI),IC);
          wait until rising_edge(CLK);
        end loop;
      end loop;
    end loop;
    for K in 1 to 3 loop
      wait until rising_edge(CLK);
    end loop;
    stop(0);
  end process; 
-- unit under test 
  uut:entity work.DUAL_MUL9x9 generic map(USE_LUTs=>USE_LUTs)
                              port map(CLK=>CLK,
                                       IA=>IA,
                                       IB=>IB,
                                       IC=>IC,
                                       OA=>OA,
                                       OB=>OB); 
-- golden reference model with the same latency as the DSP48E2, 3 clocks
  process(CLK)
  begin
    if rising_edge(CLK) then 
      OA1<=IA*IC;
      OB1<=IB*IC;
      OA2<=OA1;
      OB2<=OB1;
      OA3<=OA2;
      OB3<=OB2;
    end if;
  end process; 
-- compare the two outputs and report any mismatches
  process
  begin
    wait until rising_edge(CLK);
    assert OA=OA3 and OB=OB3 report "Data Mismatch!" severity error;
  end process; 
end TB;

Exhaustive testing like this is not always possible but in this case it works due to the smaller size of the operands and ensures that the design functionality is correct.

What about multiplications larger than 27x18? In the next post I will look at how to implement multiple precision multipliers with more than one DSP48E2.

Back to the top: The Art of FPGA Design