The Universal MUX Building Block Part 3, the one with the Dutch Cocoa Box and the Ouroboros
We have seen in the previous post that Vivado Synthesis is able to optimally infer a mux form behavioral code for multiplexers with up to 16 inputs, but beyond that not so much. The synthesis results are not bad but for high performance designs where every LUT and especially every logic level counts not bad is not good enough.
So in this post I will present a solution to this problem, that is both optimal, generic and reusable and very elegant due to the use of recursion. Functions that recursively call themselves are pretty common for software developers, but in hardware land components that instantiate themselves are a rara avis, "a bird as rare upon the earth as a black swan".
So here is the plan, first of all we will modify the generic MUX from last post so that it also has a generic WIDTH, not just a generic number N of inputs. The N inputs each one of WIDTH bits are concatenated together into an STD_LOGIC_VECTOR input port I of length N*WIDTH while the O port is now an STD_LOGIC_VECTOR of length WIDTH. This module is even more generic than the previous single bit version, which simply becomes the case WIDTH=1.
Secondly, we will resize the input port I so that the number of inputs is increased to the closest power of 2 that is greater or equal to N. We do this by zero padding the MSBs of the extended I port - the synthesis tool will remove the unused logic and we do not waste any resources by doing that. The advantage is that now we only have to handle the cases where N is a power of 2 and do not have to worry about muxes with 37 inputs for example. If N is already a power of two then this step has no effect.
If N is 16 or less we will use the one line behavioral implementation since the synthesis tool already gives us the optimal result, no need to micro-manage what already works well. If N is 32 we divide the mux into a stage with two 16:1 muxes, followed by a MUXF9 primitive instantiation. We need to do this because the synthesis tool does not infer MUXF9 primitives. We create the first stage muxes by recursively instantiating the MUX component itself, the very thing we are describing in this code!
For the N equal or greater than 64 case we divide the larger mux into two smaller stages and recursively instantiate MUX itslef two times. So our generic and recursive MUX module now looks like this:
library IEEE;
use IEEE.STD_LOGIC_1164.all;
use IEEE.NUMERIC_STD.all;
library UNISIM;
use UNISIM.vcomponents.all;
entity MUX is
port(I:in STD_LOGIC_VECTOR; -- unconstrained size input port, I'length must be a multiple of O'length
SEL:in UNSIGNED; -- unconstrained size selection port
O:out STD_LOGIC_VECTOR); -- unconstrained size output port
end MUX;
architecture RECURSIVE of MUX is
constant WIDTH:INTEGER:=O'length; -- this is the data width
constant N:INTEGER:=2**SEL'length; -- number of inputs extended to the next power of 2
signal II:STD_LOGIC_VECTOR(N*WIDTH-1+I'low downto I'low);
begin
assert I'length mod O'length=0 report "I'length must be a multiple of O'length!" severity warning;
assert (2**SEL'length<=I'length/O'length) and (I'length/O'length<2**(SEL'length+1)) report "Ports I, O and SEL have inconsistent sizes!" severity warning; – severity level can be note, warning or error
II<=STD_LOGIC_VECTOR(RESIZE(UNSIGNED(I),II'length)); -- pad I with zeroes on the MSP side
-- use behavioral code for sizes for which the synthesis tool infers the right result, 2:1 to 16:1 in this case
l2:if SEL'length<5 generate
O<=II(WIDTH*TO_INTEGER(SEL)+WIDTH-1+II'low downto WIDTH*TO_INTEGER(SEL)+II'low); -- I and O are not necessarily 0 based so we have to compensate for that
end generate;
-- instantiate an F9MUX explicitly for the 32:1 case
l5:if SEL'length=5 generate
signal M:STD_LOGIC_VECTOR(2*WIDTH-1 downto 0);
begin
ml:entity work.MUX port map(I=>II,SEL=>SEL(SEL'high downto SEL'low+1),O=>M);
lk:for K in 0 to WIDTH-1 generate
f9:MUXF9 port map (I0=>M(K),I1=>M(WIDTH+K),S=>SEL(SEL'low),O=>O(K+O'low));
end generate;
end generate;
-- for sizes 64:1 and larger use recursive instantiation
ln:if SEL'length>5 generate
constant MID:INTEGER:=(SEL'length-1) mod 5+1;
signal M:STD_LOGIC_VECTOR((2**MID)*WIDTH-1 downto 0);
begin
ml:entity work.MUX port map(I=>II,SEL=>SEL(SEL'high downto SEL'low+MID),O=>M);
mh:entity work.MUX port map(I=>M,SEL=>SEL(SEL'low+MID-1 downto SEL'low),O=>O);
end generate;
end RECURSIVE;
So now all ports of the MUX module are unconstrained and the generic parameters are implicit. the data width is WIDTH=O'length, the number of mux inputs is N=2**SEL'length (always a power of 2), the actual number of inputs is I'length/WIDTH. II is the zero padded version of I.
This code looks deceptively simple, if you ignore the l5 case, which is only required to fix a synthesis tool limitation and do not count all the generate statements there are only three lines of VHDL code that actually generate hardware, the behavioral mux description in l2 and the two recursive MUX instantiations in ln. With these three lines of code we can describe a multiplexer with any number of inputs N, any data WIDTH and which is guaranteed to be optimal both in terms of area (LUT6 count) and speed (number of logic levels).
As an example of how this generic MUX module could be used, here is an example of a 1024:1 mux for 16-bit data words:
library IEEE;
use IEEE.STD_LOGIC_1164.all;
use IEEE.NUMERIC_STD.all;
entity MUX_TEST is
generic(SIZE:INTEGER:=10; -- the mux has N=2**SIZE inputs
WIDTH:INTEGER:=16); -- the mux has a WIDTH-bit output
port(CLK:in STD_LOGIC;
I:in STD_LOGIC_VECTOR((2**SIZE)*WIDTH-1 downto 0);
SEL:in UNSIGNED(SIZE-1 downto 0);
O:out STD_LOGIC_VECTOR(WIDTH-1 downto 0));
end MUX_TEST;
architecture TEST of MUX_TEST is
signal RI:STD_LOGIC_VECTOR(I'range):=(others=>'0');
signal RSEL:UNSIGNED(SEL'range):=(others=>'0');
signal RO:STD_LOGIC_VECTOR(O'range):=(others=>'0');
begin
-- input registers
process(CLK)
begin
if rising_edge(CLK) then
RI<=I; RSEL<=SEL;
end if;
end process;
-- this is the actual generic mux instantiation, any number of inputs, any data width
mx:entity work.MUX port map(I=>RI, SEL=>RSEL, O=>RO);
-- output registers
process(CLK)
begin
if rising_edge(CLK) then
O<=RO;
end if;
end process;
end TEST;
The synthesis result uses the expected 4224 LUTs and since there are only two logic levels it runs easily at speeds of over 600MHz. By changing the SIZE and WIDTH generics you can easily test multiplexers of any size and width. The input and output registers are there just to measure the fMAX of the otherwise combinatorial MUX block.
Do not let the apparent simplicity of this design deceive you, this is not trivial, recursion never is. The best way to understand it and wrap your mind around it is to actually synthesize and implement this code with various SIZE and DEPTH generic values and then use the excellent Vivado schematic viewer to inspect the results.
This generic MUX building block is a very useful component that can be reused in many places, beyond the obvious operation of selecting on input out of many. Barrel shifters are just muxes, a lot of them. For example a 32-bit barrel shifter is just 32 32:1 muxes. Cross-bar switches are another example of more complex blocks that can be reduced to a lot of muxes. Having a generic and reusable MUX building block that always produces optimal synthesis results can help a lot in the design of more complicated things.
This MUX block is also the simplest example that I can find of a case where recursive component instantiation makes sense. But the list of hardware designs where recursion can be used effectively is long: priority encoders, adder trees, FFTs, parallel sorting networks, etc.
As for the title of this post and its relevance to Art, FPGAs and Design, please see these links:
Back to the top: The Art of FPGA Design