The Art of FPGA Design - Post 9

4 Sep 2018

The Universal DELAY Building Block Part 2, the one with the cake

In the last post I have introduced an example of a universal delay block that uses a behavioral implementation to create a reusable module that can be used to delay a signal by an arbitrary but fixed value. Both the delay size and its width are generic respectively unconstrained, which makes the design reusable. While the behavioral implementation is quite compact and elegant, the synthesis result is not always what we really need. It is left to the synthesis tool to decide how to map our behavioral code into available FPGA primitives.

This is a double edged sword, on one hand it's nice to describe the design at a higher level and not have to worry about the implementation details. If the synthesis results are good or at least not critical to the design performance in terms of size and speed it is actually the correct approach. On the other hand, if the performance of your design depends heavily on the actual implementation, for example if you have a lot of wide and deep delays or if you need to achieve a very high clock speed the implementation does matter and the best approach is to instantiate primitives and build exactly what you need. This of course is a lower level coding style and this is usually not desirable, you only do this if you have no other choice. Both methods have their advantages and disadvantages - this post is about how to have your cake and eat it too.

Today's challenge is to modify the SDELAY block I created in the last post so that the implementation is controlled by a user provided generic or automatically as a function of the SIZE generic parameter.

Let's review first what implementation options we have for delay lines. We can use a chain of FFs, dedicated shift register primitives called SRL16 and SRL32, or a RAM and a counter and there are three kinds of RAM resources in Xilinx FPGAs that could be used, distributed RAM, Block RAM and Ultra RAM. As a general rule, for SIZE=2 or less FFs should be used, for SIZE from 3 to 17 SRL16s are the best choice (they can map two per LUT6 if they have the same size), from 18 to 33 SRL32s are better (one per LUT6). For SIZE greater than 33 a RAM+counter based implementation is usually the best choice. The boundary between distributed RAM, BRAM and URAM is a gray area but as a general rule SIZE=256 or 512 is a normal threshold between distributed RAMs and BRAMs and SIZE=4096 between BRAMs and URAMs (not all Xilinx FPGAs have URAM resources).

The easiest way to control how the delay is mapped into primitives is to keep a behavioral coding style and use synthesis attributes to direct the synthesis tool to use a particular implementation. There are two relevant attributes called SHREG_EXTRACT and SRL_STYLE. If SHREG_EXTRACT is set to "no" no shift registers will be inferred and you will get a ton of FFs instead. Unless you have very shallow delays with SIZE=2 or 3 and you want to save SRL resources this attribute should be set to "yes", which is also the default setting. The second attribute is more interesting, If SHREG_EXTRACT="yes" then SRL_STYLE can be set to "register", "srl", "srl_reg", "reg_srl", "reg_srl_reg" or "block". They do exactly what they say and "reg_srl_reg" seems to be the default value, which explains the synthesis result we saw last time. A synthesis attribute to control how the shift register is implemented can be used in the VHDL code like this:

library IEEE; 
use IEEE.STD_LOGIC_1164.all;  
use work.types_pkg.all;  

entity SDELAY is
  generic(SIZE:NATURAL:=1);  –- SIZE has a default value of 1 and cannot be negative, this would require traveling back in time
  port(CLK:in STD_LOGIC:='0'; –- an input port with a default value can be left unconnected, this could be useful when SIZE=0 
       I:in SFIXED;
       O:out SFIXED); 
end SDELAY;  

architecture TEST of SDELAY is
  signal D:SFIXED_VECTOR(0 to SIZE)(I'range):=(others=>TO_SFIXED(0.0,I)); -– delay line signal is SIZE+1 in length
  attribute srl_style:STRING;
  attribute srl_style of D:signal is "srl_reg"; -– acceptable values are "reg", "srl", "srl_reg", "reg_srl", "reg_srl_reg" or "block" 
begin 
  D(D'low)<=I;
  process(CLK)
  begin
    if rising_edge(CLK) then
      for K in 1 to SIZE loop 
        D(K)<=D(K-1); -– when SIZE=0 this is never executed and we get no registers, just a wire between I and O
      end loop;
    end if;
  end process;
  O<=D(D'high); 
end TEST;

The problem with this code is that to change the implementation you would have to change the SDELAY code and then the module is not reusable. It is much better if we add a new STRING generic and control the implementation form outside the SDELAY module:

library IEEE; 
use IEEE.STD_LOGIC_1164.all;  
use work.types_pkg.all;  

entity SDELAY is
  generic(SIZE:NATURAL:=1;    -– SIZE has a default value of 1 and cannot be negative, this would require traveling back in time 
          STYLE:STRING:="srl_reg");  -– acceptable values are "reg", "srl", "srl_reg", "reg_srl", "reg_srl_reg" or "block"
  port(CLK:in STD_LOGIC:='0'; -– an input port with a default value can be left unconnected, this could be useful when SIZE=0 
       I:in SFIXED; O:out SFIXED); 
end SDELAY;  

architecture TEST of SDELAY is
  signal D:SFIXED_VECTOR(0 to SIZE)(I'range):=(others=>TO_SFIXED(0.0,I)); -– delay line signal is SIZE+1 in length
  attribute srl_style:STRING; attribute srl_style of D:signal is STYLE; 
begin 
  D(D'low)<=I;
  process(CLK)
  begin
    if rising_edge(CLK) then
      for K in 1 to SIZE loop 
        D(K)<=D(K-1); -– when SIZE=0 this is never executed and we get no registers, just a wire between I and O
      end loop;
    end if;
  end process;
  O<=D(D'high); 
end TEST;

Now we can not only instantiate an SDELAY module that will delay a SFIXED signal of any range by an arbitrary SIZE amount but we can also control the way the delay is implemented, as FFs, SRLs or BRAMs using the STYLE generic.

This looks like the best solution but still there are a number of issues. For example, for SIZE 17 or less when SRL16s are being used we may want to have two of them mapped in a single LUT6, otherwise we will use twice as many memory capable LUT6es and that could be a limited resource. For SIZE between 33 and 65 it is possible to use distributed RAM and map up to 64 delays in a single LUT6, which is also twice as efficient as when using SRL32s. When we select the "block" style we get a BRAM implementation even when a distributed RAM or a URAM could be a better choice and so on.

To address any of these issues we have to use low level coding and instantiate primitives or use more explicit behavioral code. We start with the existing SDELAY implementation but we treat the cases where 2<SIZE<18 and 33<SIZE<258 as exceptions (by the way, this is VHDL-2008 if/generate syntax, it will give you an error if you compile it as VHDL-93). Now we can achieve mapping two SRL16s per LUT6 and use distributed RAM instead of SRL32s where it makes sense. In both cases we reduce the LUT6 count by a factor of 2x compared with what the synthesis tool infers from behavioral code, which could become very significant if we use a lot of such SDELAYs:

library IEEE; 
use IEEE.STD_LOGIC_1164.all; 
use IEEE.NUMERIC_STD.ALL;  
use work.TYPES_PKG.all;  

library UNISIM; 
use UNISIM.VComponents.all;  

entity SDELAY is
  generic(SIZE:NATURAL:=1;    -– SIZE has a default value of 1 and cannot be negative, this would require traveling back in time 
          STYLE:STRING:="srl_reg");  -– acceptable values are "reg", "srl", "srl_reg", "reg_srl", "reg_srl_reg" or "block"
  port(CLK:in STD_LOGIC;
       I:in SFIXED;
       O:out SFIXED); 
end SDELAY;  

architecture TEST of SDELAY is
  attribute rloc:STRING; 
begin 
  l17:if SIZE>=3 and SIZE<18 generate -– pack two SRL16s per LUT6 by forcing 8 or 16 of them to be placed in the same slice
        signal iO:SFIXED(I'range):=(others=>'0');
      begin 
        lk:for K in 0 to I'length-1 generate
             signal A:UNSIGNED(3 downto 0);
             signal Q:STD_LOGIC;
             signal RQ:STD_LOGIC:='0';
             attribute rloc of sr:label is "X0Y"&INTEGER'image(K/16); -– use K/8 for 7-series or earlier families, K/16 for UltraScale and UltraScale+
           begin 
             A<=TO_UNSIGNED(SIZE-2,A'length);
             sr:SRL16E port map(CLK=>CLK,
                                CE=>'1',
                                A0=>A(0),
                                A1=>A(1),
                                A2=>A(2),
                                A3=>A(3),
                                D=>I(I'low+K),
                                Q=>Q);
             process(CLK)
             begin
               if rising_edge(CLK) then 
                 RQ<=Q;
               end if;
             end process;
             iO(iO'low+K)<=RQ;
           end generate;
         O<=RESIZE(iO,O);
      end;
  elsif l257: SIZE>=34 and SIZE<258 generate -– infer a distributed RAM implementation
        signal MEM:SFIXED_VECTOR(0 to SIZE-2)(I'range):=(others=>(others=>'0'));
        signal A:UNSIGNED(LOG2(SIZE-1)-1 downto 0):=(others=>'0');
        signal iO:SFIXED(I'range):=(others=>'0');
        attribute ram_style:STRING;
        attribute ram_style of MEM:signal is "distributed";
      begin
        process(CLK)
        begin
          if rising_edge(CLK) then
            if A=SIZE-2 then
              A<=(others=>'0');
            else 
              A<=A+1;
            end if;
            MEM(TO_INTEGER(A))<=I;
            iO<=MEM(TO_INTEGER(A));
          end if;
        end process;
        O<=RESIZE(iO,O);
      end;
   else generate –- otherwise use the behavioral implementation and the STYLE generic
        signal D:SFIXED_VECTOR(0 to SIZE)(I'range):=(others=>TO_SFIXED(0.0,I)); –- delay line signal is SIZE+1 in length
        attribute srl_style:STRING; attribute srl_style of D:signal is STYLE;
      begin 
        D(D'low)<=I;
        process(CLK)
        begin
          if rising_edge(CLK) then
            for K in 1 to SIZE loop 
              D(K)<=D(K-1); -– when SIZE=0 this is never executed and we get no registers, just a wire between I and O
            end loop;
          end if;
        end process;
        O<=D(D'high);
      end;
    end generate; 
end TEST;

We have achieved the mapping of 2 SRL16s per LUT6 by forcing 8 (or 16 for UltraScale and UltraScale+ FPGA families) of them to be placed in the same slice, which contains 4 respectively 8 LUT6es. Similarly, for the 33<SIZE<258 case we told the synthesis tool to use distributed RAM instead of SRL32s or BRAMs but we had to explicitly code the actual counter+RAM implementation because distributed RAM is not a valid option of the SRL_STYLE attribute and inference would not work in this particular case.

These lower level implementations for the two special cases are more complicated than the much cleaner behavioral version but we can hide all the gory details inside the SDELAY module and its use remains virtually the same as the fully behavioral case.

There are two important lessons here. First, leverage the synthesis tool behavioral inference mechanism as much as possible but when you do not get the expected or desired result do not fight with the tools, that's a game you cannot win. If gentle hints through attributes and coding style do not work stop pushing and handle the synthesis and if needed the logic mapping and even placement yourself. Secondly, if you have to go this route try to still keep the code as generic and reusable as possible by hiding the implementation complexities in lower level modules. This is how cake should be eaten.

Back to the top: The Art of FPGA Design