Sorting Networks - The VHDL implementation of Batcher's sorting algorithm
OK, enough with the preliminaries, it's time now for the real deal, how do we implement in an FPGA this parallel sorting network thing. The main design goals are creating a generic, reusable and efficient implementation. We want to be able to implement a sorting network of any size N with a single piece of code, we want to be able to sort all kinds of data, not just integers and we want something close in size to the best known optimal sorting networks. Of course, we also want to run as fast as possible, at least at 500MHz if not faster, sorting one set of N elements every clock. It looks like a pretty tall order so let's try to do to it.
We will start with a VHDL package, which defines the basic data type,as well as an unconstrained array of such abstract type and a function to compare two such elements and decide which one is "smaller". This is VHDL-2008 syntax, these definitions will not work in a VHDL-93 environment. We will use unconstrained UNSIGNED as the basic type and the "<" or "less than" comparison function as an example but in theory we are free to use a wide range of basic types and comparison functions here. To sort different kinds of items other than the UNSIGNED(7 downto 0) I am using as an example you would have to make changes here. This package also includes some helper functions to calculate the latencies of various blocks, which will come handy later on. They will be used to equalize the latencies of slightly different parallel paths of data when they will need to be merged together:
use IEEE.STD_LOGIC_1164.all;
use IEEE.NUMERIC_STD.all;
package SORTER_PKG is
subtype T is UNSIGNED(7 downto 0); -- define your own base type T here
constant T_ZERO:T:=(others=>'0'); -- depends on actual type T
function "<"(L,R:T) return BOOLEAN; -- define compare function for type T
type T_VECTOR is array(INTEGER range <>) of T; -- unconstrained vector of abstract base type T
function MERGE_LATENCY(N,CE_LATENCY:INTEGER) return INTEGER; -- the latency of a sorting network of size N, given the latency of the COMP_EXCH block
function SORT_LATENCY(N,CE_LATENCY:INTEGER) return INTEGER; -- the latency of a merge operation of size N, given the latency of the COMP_EXCH block
end SORTER_PKG;
package body SORTER_PKG is
function "<"(L,R:T) return BOOLEAN is
begin
return IEEE.NUMERIC_STD."<"(L,R); -- to avoid infinite recursion loop when "<" is already defined for T
end;
function SORT_LATENCY(N,CE_LATENCY:INTEGER) return INTEGER is
begin
case N is
when 1=>return 0;
when 2=>return CE_LATENCY;
when others=>return SORT_LATENCY((N+1)/2,CE_LATENCY)+MERGE_LATENCY(N,CE_LATENCY);
end case;
end;
function MERGE_LATENCY(N,CE_LATENCY:INTEGER) return INTEGER is
begin
case N is
when 1=>return 0;
when 2=>return CE_LATENCY;
when others=>return MERGE_LATENCY((N+1)/2-1,CE_LATENCY)+CE_LATENCY;
end case;
end;
end SORTER_PKG;
The next step up the bottom-up design ladder is a "compare and exchange" component. COMP_EXCH has two inputs and two outputs and either swaps or passes the two inputs to the outputs so that they are always in ascending order. Another way to look at COMP_EXCH is as a MIN(I0,I1) function driving the first output and a MAX(I0,I1) driving the second output but the best hardware implementation uses a single comparator and two 2:1 muxes. There is also a generic to control the latency of this building block, we can get a combinatorial implementation if we want or introduce one or more register levels for pipelining to increase throughput. Generic DELAY and BDELAY blocks are used for this purpose, of the kind we discussed in posts 8 and 9. You can think of this COMP_EXCH block as the solution for our sorting network problem for the very particular and easy case of N=2:
use IEEE.STD_LOGIC_1164.ALL;
use work.SORTER_PKG.all;
entity COMP_EXCH is
generic(LATENCY:INTEGER:=0);
port(CLK:in STD_LOGIC:='0';
I0,I1:in T;
VI:in BOOLEAN:=TRUE;
O:out T_VECTOR(0 to 1);
VO:out BOOLEAN);
end COMP_EXCH;
architecture TEST of COMP_EXCH is
signal I:T_VECTOR(O'range);
begin
I<=(I1,I0) when I1<I0 else (I0,I1);
dl:entity work.DELAY generic map(SIZE=>LATENCY)
port map(CLK=>CLK,
I=>I,
O=>O);
bd:entity work.BDELAY generic map(SIZE=>LATENCY)
port map(CLK=>CLK,
I=>VI,
O=>VO);
end TEST;
The next step is a generic MERGE block. This takes an input of an arbitrary size N, with the lower and upper halves already sorted and merges them into a fully sorted output. This is a recursive component definition, defined in terms of smaller instances of itself, with the recursion stopping at N=1, when MERGE simply passes its one input to the output and N=2 when MERGE reduces to one COMP_EXCH block:
library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use work.SORTER_PKG.all;
-- MERGE takes two sorted arrays concatenated as an argument and returns a sorted array as a result
-- if N is even the two sorted arrays are of equal lengths
-- if N is odd the first sorted array is one element longer than the second one
entity MERGE is
generic(CE_LATENCY:INTEGER:=0);
port(CLK:in STD_LOGIC:='0';
I:in T_VECTOR;
VI:in BOOLEAN:=TRUE;
O:out T_VECTOR;
VO:out BOOLEAN);
end MERGE;
architecture TEST of MERGE is
constant N:NATURAL:=I'length;
constant M:INTEGER:=N/2-(N-1) mod 4/3; -- N/2-1 when N mod 4=0 else N/2
-- EVEN returns an array composed of the even elements of the argument (the first element of I is always considered even)
function EVEN(I:T_VECTOR) return T_VECTOR is
constant N:NATURAL:=I'length;
variable Result:T_VECTOR(I'low to I'low+(N+1)/2-1);
begin
for K in Result'range loop
Result(K):=I(I'low+2*(K-Result'low));
end loop;
return Result;
end;
-- ODD returns an array composed of the odd elements of the argument (the first element of I is always considered even)
function ODD(I:T_VECTOR) return T_VECTOR is
constant N:NATURAL:=I'length;
variable Result:T_VECTOR(I'low to I'low+N/2-1);
begin
for K in Result'range loop
Result(K):=I(I'low+2*(K-Result'low)+1);
end loop;
return Result;
end;
begin
assert I'length=O'length report "Ports I and O must have the same lenght!" severity error;
i1:if N=1 generate
O<=I;
VO<=VI;
end generate;
i2:if N=2 generate
cx:entity work.COMP_EXCH generic map(LATENCY=>CE_LATENCY)
port map(CLK=>CLK,
I0=>I(I'low),
I1=>I(I'high),
VI=>VI,
O=>O,
VO=>VO);
end generate;
i3:if N>2 generate
signal IL,ML:T_VECTOR(I'low to I'low+(N+1)/2-1+(N+1) mod 4/3);
signal IH,MH,MHD:T_VECTOR(I'low+(N+1)/2+(N+1) mod 4/3 to I'high);
signal V:T_VECTOR(I'range);
signal VM:BOOLEAN;
begin
IL<=EVEN(I(I'low to I'low+(N+1)/2-1))&EVEN(I(I'low+(N+1)/2 to I'high));
lm:entity work.MERGE generic map(CE_LATENCY=>CE_LATENCY)
port map(CLK=>CLK,
I=>IL,
VI=>VI,
O=>ML,
VO=>VM);
IH<=ODD(I(I'low to I'low+(N+1)/2-1))&ODD(I(I'low+(N+1)/2 to I'high));
hm:entity work.MERGE generic map(CE_LATENCY=>CE_LATENCY)
port map(CLK=>CLK,
I=>IH,
VI=>VI,
O=>MH,
VO=>open);
dl:entity work.DELAY generic map(SIZE=>MERGE_LATENCY(ML'length,CE_LATENCY)-MERGE_LATENCY(MH'length,CE_LATENCY))
port map(CLK=>CLK,
I=>MH,
O=>MHD);
V<=ML&MHD;
d0:entity work.DELAY generic map(SIZE=>CE_LATENCY)
port map(CLK=>CLK,
I=>V(V'low to V'low),
O=>O(O'low to O'low));
lk:for K in 1 to (N+1)/2-1 generate
cx:entity work.COMP_EXCH generic map(LATENCY=>CE_LATENCY)
port map(CLK=>CLK,
--CE=>CE,
I0=>V(V'low+K),
I1=>V(V'low+K+M),
O=>O(O'low+2*K-1 to O'low+2*K));
end generate;
i0:if N mod 4=0 generate
dl:entity work.DELAY generic map(SIZE=>CE_LATENCY)
port map(CLK=>CLK,
I=>V(V'high to V'high),
O=>O(O'high to O'high));
end generate;
i2:if N mod 4=2 generate
dl:entity work.DELAY generic map(SIZE=>CE_LATENCY)
port map(CLK=>CLK,
I=>V(V'low+M to V'low+M),
O=>O(O'high to O'high));
end generate;
v0:entity work.BDELAY generic map(SIZE=>CE_LATENCY)
port map(CLK=>CLK,
I=>VM,
O=>VO);
end generate;
end TEST;
Now we can see some light at the end of the tunnel, we can introduce the the final top level MERGE_SORT block which turns out to be the simplest of all. It is also recursively defined in terms two smaller instances of itself plus the MERGE block we saw earlier. Here again the recursion stops at N=1 where MERGE_SORT just forwards the input to the output and N=2 where MERGE_SORT is one COMP_EXCH instance, as we know already that COMP_EXCH is a sorting network of size 2:
library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use work.SORTER_PKG.all;
-- MERGE_SORT takes an array as argument and returns the sorted array as a result
-- if N is even the array is divided into two equal parts
-- if N is odd the first part is one element longer than the second part
entity MERGE_SORT is
generic(CE_LATENCY:INTEGER:=0);
port(CLK:in STD_LOGIC:='0';
I:in T_VECTOR;
VI:in BOOLEAN:=TRUE;
O:out T_VECTOR;
VO:out BOOLEAN);
end MERGE_SORT;
architecture TEST of MERGE_SORT is
constant N:NATURAL:=I'length;
begin
assert I'length=O'length report "Ports I and O must have the same lenght!" severity error;
i1:if N=1 generate
O<=I;
VO<=VI;
end generate;
i2:if N=2 generate
cx:entity work.COMP_EXCH generic map(LATENCY=>CE_LATENCY)
port map(CLK=>CLK,
I0=>I(I'low),
I1=>I(I'high),
VI=>VI,
O=>O,
VO=>VO);
end generate;
i3:if N>2 generate
signal SL:T_VECTOR(I'low to I'low+(N+1)/2-1);
signal SH,SHD:T_VECTOR(I'low+(N+1)/2 to I'high);
signal S:T_VECTOR(I'range);
signal VS:BOOLEAN;
begin
ml:entity work.MERGE_SORT generic map(CE_LATENCY=>CE_LATENCY,
SIZE=>(N+1)/2)
port map(CLK=>CLK,
I=>I(I'low to I'low+(N+1)/2-1),
VI=>VI,
O=>SL,
VO=>VS);
mh:entity work.MERGE_SORT generic map(CE_LATENCY=>CE_LATENCY,
SIZE=>I'length-(N+1)/2)
port map(CLK=>CLK,
I=>I(I'low+(N+1)/2 to I'high),
VI=>VI,
O=>SH,
VO=>open);
dl:entity work.DELAY generic map(SIZE=>SORT_LATENCY(SL'length,CE_LATENCY)-SORT_LATENCY(SH'length,CE_LATENCY))
port map(CLK=>CLK,
I=>SH,
O=>SHD);
S<=SL&SHD;
me:entity work.MERGE generic map(CE_LATENCY=>CE_LATENCY)
port map(CLK=>CLK,
I=>S,
VI=>VS,
O=>O,
VO=>VO);
end generate;
end TEST;
So to recapitulate, we have created a parallel sorting network based on the Batcher or odd-even mergesort sorting algorithm, which divides the input set into two halves, sorts them separately (and recursively) and then merges them into a single sorted output. While not optimal in the sense of the least number of COMP_EXCH steps, mergesort is pretty close and more than good enough for the practical sizes one will encounter in hardware implementations. For this small price we paid in inefficiency we get a parameterizable implementation that can implement a sorting network of arbitrary size N (remember that we do not even know what the optimal sorting networks for sizes greater than N=10 are and there are no known algorithms to generate them programmatically). We also kept the code as generic as possible, with an arbitrary base type and a user defined comparison function, as well as a generic to control the latency of the entire thing, from a purely combinatorial implementation to a fully pipelined, very high speed one.
You may think that we have finally put this matter of sorting networks to rest but there a few things left. Since this post got quite long, next week we will look at how to instantiate this generic sorting network module and what kind of performance in terms of size and speed we can expect. Maybe we can improve on the design a bit too. Then the week after that we will try to answer a very difficult question - how do we know that this design actually works, especially for any arbitrary size N? Maybe it works for some smaller sizes but fails at larger ones, or maybe it sorts almost all possible input combinations but fails at some particular corner case, like already sorted or reverse sorted inputs. This is not a trivial question and it is a problem as interesting as the sorting network itself. We cannot move on to other subjects before we look into this design verification problem for sorting networks in more detail.
Back to the top: The Art of FPGA Design
Top Comments