The Universal MUX Building Block Part 2
So the question is now what is the most efficient implementation for arbitrary size multiplexers one should expect? If the result the synthesis tools infers from behavioral code is equal or very close to this there is no need for a specialized MUX Building Block. If the difference is significant then there will be a definite need for such a block, especially for designs with large muxes and/or many of them.
To simplify the analysis we will focus on multiplexers with an input size that is a power 2, this is usually the case in most designs. For the odd cases when the number of mux inputs is not a power of two the usual approach is to create the smallest power of 2 input mux that is larger than what we need, connect the unused inputs to logic zero and let the synthesis tool optimize out the unused logic. We will also assume for simplicity that the data we are muxing (the size of the output port) is one bit, anything wider can be constructed from multiple parallel instantiations of a single bit mux.
Every fabric slice contains 8 LUT6es (or 4 in 7-series FPGAs) and each LUT6 can implement two 2:1 muxes with a shared SEL input, which maps exactly inside a LUT6_2, or one 4:1 mux, a 6-input logic function which maps neatly into a LUT6. So for the cases of 2 and 4 inputs the answer is simple, the optimal utilization is 1/2 LUT6es respectively 1 LUT6 per data bit.
If we were to implement an 8-input mux using just LUT6es, the only way to do that is to decompose it into smaller 2:1 and 4:1 ones. So it looks like we need 2.5 LUT6es/bit, two 4:1 muxes followed by a 2:1 mux. The other possible decomposition, four 2:1 muxes followed by a 4:1 one is worse, requiring 3 LUT6es/bit. A 16-input mux would consist of two stages of 4:1 muxes, four in the first stage and one in the second stage, so 5 LUT6es/bit and so on.
However, the XIlinx FPGA slices have a dedicated resource called FxMUX (F7MUX, F8MUX and F9MUX) which is essentially free to use and can be used for more efficient implementations of larger muxes, as well as arbitrary logic functions of more than 6 inputs. The following diagrams taken from the UG574 UltraScale Architecture Configurable Logic Block User Guide - a must read for anyone trying to understand the FPGA architecture for the purpose of creating very efficient designs - shows 16 and 32 input muxes implemented with a combination of LUT6es and FxMUXes:
An optimal 8-input mux implementation would need 2 LUT6es and F7MUX, so 2 LUT^es/bit since the FxMUXes are essentially free. A 16-input mux would similarly require 4 LUT6es and a 32-input mux 8 LUT6es. So using the FxMUX resources will lead to both smaller (less LUT6es) and faster (fewer logic levels) designs. For example, a 1024-input mux that does not use the FxMUX resource would have 5 stages of 4:1 muxes, with 256, 64, 16, 4 and 1 LUT6es in each stage, for a total of 341 LUT6es and 5 logic levels. The optimal implementation would consist of only two stages of 32:1 muxes with 32 of them in the first stage and one in the second stage. The total LUT6 count in this case is 32*8+8=264 LUTs and only two logic levels. The optimal implementation is 1.29x smaller and 2.5x faster, 3.2x better than the naive implementation which is quite significant. The optimal numbers for various 1-bit muxes with a power of 2 number of inputs is summarized in this table:
N | 2 | 4 | 8 | 16 | 32 | 64 | 128 | 256 | 512 | 1024 |
---|---|---|---|---|---|---|---|---|---|---|
LUT6es | 0.5 | 1 | 2 | 4 | 8 | 16.5 | 33 | 66 | 132 | 264 |
Logic Levels | 1 | 1 | 1 | 1 | 1 | 2 | 2 | 2 | 2 | 2 |
We need to find out now if Vivado synthesis will infer these optimal mux structures from the behavioral code we introduced in the previous post. For smaller muxes you get indeed the expected result, however, for larger muxes the result is not optimal. The code I used for this test is a slightly modified version of the earlier one, with a generic parameter SIZE and input and output registers to make an estimation of the design speed possible:
library IEEE;
use IEEE.STD_LOGIC_1164.all;
use IEEE.NUMERIC_STD.all;
entity MUX is generic(SIZE:INTEGER:=6); – the mux has N=2**SIZE inputs
port(CLK:in STD_LOGIC; I:in STD_LOGIC_VECTOR(2**SIZE-1 downto 0); – unconstrained vector of STD_LOGIC
SEL:in UNSIGNED(SIZE-1 downto 0); – unconstrained size selection port
O:out STD_LOGIC);
end MUX;
architecture TEST of MUX is
signal RI:STD_LOGIC_VECTOR(I'range):=(others=>'0');
signal RSEL:UNSIGNED(SEL'range):=(others=>'0');
begin
assert (I'length<=2**I'length) and (I'length<2**(SEL'length+1)) report "Ports I and SEL have inconsistent sizes!" severity warning; – severity level can be note, warning or error
process(CLK)
begin
if rising_edge(CLK) then
RI<=I; RSEL<=SEL; O<=RI(TO_INTEGER(RSEL)+I'low); – I is not necessarily 0 based so we add I'low to SEL
end if;
end process;
end TEST;
The 1024-input mux for example (SIZE=10, N=1024) uses 273 LUT6es, 136 F7 muxes and 68 F8 muxes instead of the expected 264 LUT6es, 132 F7 muxes, 66 F8 muxes and 33 F9 muxes (33 full 32:1 muxes, each one using 8 LUT6es, 4 F7 muxes, 2 F8 muxes and 1 F9 mux). The suboptimal result is due to Vivado synthesis inferring F7MUX and F8MUX primitives but no F9MUXes. In a -2 speed grade Kintex UltraScale+ XCKU5P this design closes timing at 720MHz. The critical path has 3 logic levels instead of the expected 2.
In conclusion, for muxes with 16 inputs or less behavioral inference is good enough, it produces optimal results and you can describe such a mux in one line of HDL code, there is no need to create design hierarchy or instantiate primitives. For larger muxes however, especially if total utilization (many instances of large muxes) and speed (minimizing the number of logic levels) matters than the development of a Universal MUX Building Block that produces optimal results is justified. While the path we took earlier with the Universal Delay of instantiating LUT6 and other primitives would be an option, the mux block is a good opportunity for introducing a new and powerful design concept, recursive component instantiation in VHDL.
One final point that needs to be made here about the FxMUXes not related to multiplexing is that they can be used to implement arbitrary logic functions of more than 6 inputs. Logic functions of 6 or less inputs will always fit inside a LUT6 and in some cases two smaller logic functions can fit in a single LUT6_2 as long as the total number of distinct inputs, not counting the shared ones twice is 5 or less. Logic functions of 7 inputs can be implemented with 2 LUT6es and one F7MUX, 8 inputs with 4 LUT6es, 2 F7MUXes and one F8MUX and 9 inputs with 8 LUT6es, 4 F7MUXes, 2 F8MUXes and one F9MUX. All these structures are small and fast, with only one logic level. You can also think of them as 64x1, 128x1, 256x1 and 512x1 ROMs done with 1, 2, 4 respectively 8 LUT6es. When the LUT6es are configured as SRL162/SRL32s or distributed RAM these same FxMUXes are used to build larger shift registers or memories out of the smaller LUT6 block in a similar way.
In the next post I will conclude the series dedicated to the Universal MUX Building Block by presenting a generic and reusable mux implementation producing optimal results for any size, which will use this new idea of recursive component instantiation. While a generic mux is the simplest basic function for which recursion makes sense, there are a number of design situations where this concept can be used to create higher level HDL code.
Back to the top: The Art of FPGA Design