The Art of FPGA Design - Post 34

7 Jul 2019

Sorting Networks - The VHDL implementation of Batcher's sorting algorithm

OK, enough with the preliminaries, it's time now for the real deal, how do we implement in an FPGA this parallel sorting network thing. The main design goals are creating a generic, reusable and efficient implementation. We want to be able to implement a sorting network of any size N with a single piece of code, we want to be able to sort all kinds of data, not just integers and we want something close in size to the best known optimal sorting networks. Of course, we also want to run as fast as possible, at least at 500MHz if not faster, sorting one set of N elements every clock. It looks like a pretty tall order so let's try to do to it.

We will start with a VHDL package, which defines the basic data type,as well as an unconstrained array of such abstract type and a function to compare two such elements and decide which one is "smaller". This is VHDL-2008 syntax, these definitions will not work in a VHDL-93 environment. We will use unconstrained UNSIGNED as the basic type and the "<" or "less than" comparison function as an example but in theory we are free to use a wide range of basic types and comparison functions here. To sort different kinds of items other than the UNSIGNED(7 downto 0) I am using as an example you would have to make changes here. This package also includes some helper functions to calculate the latencies of various blocks, which will come handy later on. They will be used to equalize the latencies of slightly different parallel paths of data when they will need to be merged together:

library IEEE;

use IEEE.STD_LOGIC_1164.all;

use IEEE.NUMERIC_STD.all;

package SORTER_PKG is

subtype T is UNSIGNED(7 downto 0); -- define your own base type T here

constant T_ZERO:T:=(others=>'0'); -- depends on actual type T

function "<"(L,R:T) return BOOLEAN; -- define compare function for type T

type T_VECTOR is array(INTEGER range <>) of T; -- unconstrained vector of abstract base type T

function MERGE_LATENCY(N,CE_LATENCY:INTEGER) return INTEGER; -- the latency of a sorting network of size N, given the latency of the COMP_EXCH block

function SORT_LATENCY(N,CE_LATENCY:INTEGER) return INTEGER; -- the latency of a merge operation of size N, given the latency of the COMP_EXCH block

end SORTER_PKG;

package body SORTER_PKG is

function "<"(L,R:T) return BOOLEAN is

begin

return IEEE.NUMERIC_STD."<"(L,R); -- to avoid infinite recursion loop when "<" is already defined for T

end;

function SORT_LATENCY(N,CE_LATENCY:INTEGER) return INTEGER is

begin

case N is

when 1=>return 0;

when 2=>return CE_LATENCY;

when others=>return SORT_LATENCY((N+1)/2,CE_LATENCY)+MERGE_LATENCY(N,CE_LATENCY);

end case;

end;

function MERGE_LATENCY(N,CE_LATENCY:INTEGER) return INTEGER is

begin

case N is

when 1=>return 0;

when 2=>return CE_LATENCY;

when others=>return MERGE_LATENCY((N+1)/2-1,CE_LATENCY)+CE_LATENCY;

end case;

end;

end SORTER_PKG;

The next step up the bottom-up design ladder is a "compare and exchange" component. COMP_EXCH has two inputs and two outputs and either swaps or passes the two inputs to the outputs so that they are always in ascending order. Another way to look at COMP_EXCH is as a MIN(I0,I1) function driving the first output and a MAX(I0,I1) driving the second output but the best hardware implementation uses a single comparator and two 2:1 muxes. There is also a generic to control the latency of this building block, we can get a combinatorial implementation if we want or introduce one or more register levels for pipelining to increase throughput. Generic DELAY and BDELAY blocks are used for this purpose, of the kind we discussed in posts 8 and 9. You can think of this COMP_EXCH block as the solution for our sorting network problem for the very particular and easy case of N=2:

library IEEE;

use IEEE.STD_LOGIC_1164.ALL;

use work.SORTER_PKG.all;

entity COMP_EXCH is

generic(LATENCY:INTEGER:=0);

port(CLK:in STD_LOGIC:='0';

I0,I1:in T;

VI:in BOOLEAN:=TRUE;

O:out T_VECTOR(0 to 1);

VO:out BOOLEAN);

end COMP_EXCH;

architecture TEST of COMP_EXCH is

signal I:T_VECTOR(O'range);

begin

I<=(I1,I0) when I1<I0 else (I0,I1);

dl:entity work.DELAY generic map(SIZE=>LATENCY)

port map(CLK=>CLK,

I=>I,

O=>O);

bd:entity work.BDELAY generic map(SIZE=>LATENCY)

port map(CLK=>CLK,

I=>VI,

O=>VO);

end TEST;

The next step is a generic MERGE block. This takes an input of an arbitrary size N, with the lower and upper halves already sorted and merges them into a fully sorted output. This is a recursive component definition, defined in terms of smaller instances of itself, with the recursion stopping at N=1, when MERGE simply passes its one input to the output and N=2 when MERGE reduces to one COMP_EXCH block:

library IEEE;

use IEEE.STD_LOGIC_1164.ALL;

use work.SORTER_PKG.all;

-- MERGE takes two sorted arrays concatenated as an argument and returns a sorted array as a result

-- if N is even the two sorted arrays are of equal lengths

-- if N is odd the first sorted array is one element longer than the second one

entity MERGE is

generic(CE_LATENCY:INTEGER:=0);

port(CLK:in STD_LOGIC:='0';

I:in T_VECTOR;

VI:in BOOLEAN:=TRUE;

O:out T_VECTOR;

VO:out BOOLEAN);

end MERGE;

architecture TEST of MERGE is

constant N:NATURAL:=I'length;

constant M:INTEGER:=N/2-(N-1) mod 4/3; -- N/2-1 when N mod 4=0 else N/2

-- EVEN returns an array composed of the even elements of the argument (the first element of I is always considered even)

function EVEN(I:T_VECTOR) return T_VECTOR is

constant N:NATURAL:=I'length;

variable Result:T_VECTOR(I'low to I'low+(N+1)/2-1);

begin

for K in Result'range loop

Result(K):=I(I'low+2*(K-Result'low));

end loop;

return Result;

end;

-- ODD returns an array composed of the odd elements of the argument (the first element of I is always considered even)

function ODD(I:T_VECTOR) return T_VECTOR is

constant N:NATURAL:=I'length;

variable Result:T_VECTOR(I'low to I'low+N/2-1);

begin

for K in Result'range loop

Result(K):=I(I'low+2*(K-Result'low)+1);

end loop;

return Result;

end;

begin

assert I'length=O'length report "Ports I and O must have the same lenght!" severity error;

i1:if N=1 generate

O<=I;

VO<=VI;

end generate;

i2:if N=2 generate

cx:entity work.COMP_EXCH generic map(LATENCY=>CE_LATENCY)

port map(CLK=>CLK,

I0=>I(I'low),

I1=>I(I'high),

VI=>VI,

O=>O,

VO=>VO);

end generate;

i3:if N>2 generate

signal IL,ML:T_VECTOR(I'low to I'low+(N+1)/2-1+(N+1) mod 4/3);

signal IH,MH,MHD:T_VECTOR(I'low+(N+1)/2+(N+1) mod 4/3 to I'high);

signal V:T_VECTOR(I'range);

signal VM:BOOLEAN;

begin

IL<=EVEN(I(I'low to I'low+(N+1)/2-1))&EVEN(I(I'low+(N+1)/2 to I'high));

lm:entity work.MERGE generic map(CE_LATENCY=>CE_LATENCY)

port map(CLK=>CLK,

I=>IL,

VI=>VI,

O=>ML,

VO=>VM);

IH<=ODD(I(I'low to I'low+(N+1)/2-1))&ODD(I(I'low+(N+1)/2 to I'high));

hm:entity work.MERGE generic map(CE_LATENCY=>CE_LATENCY)

port map(CLK=>CLK,

I=>IH,

VI=>VI,

O=>MH,

VO=>open);

dl:entity work.DELAY generic map(SIZE=>MERGE_LATENCY(ML'length,CE_LATENCY)-MERGE_LATENCY(MH'length,CE_LATENCY))

port map(CLK=>CLK,

I=>MH,

O=>MHD);

V<=ML&MHD;

d0:entity work.DELAY generic map(SIZE=>CE_LATENCY)

port map(CLK=>CLK,

I=>V(V'low to V'low),

O=>O(O'low to O'low));

lk:for K in 1 to (N+1)/2-1 generate

cx:entity work.COMP_EXCH generic map(LATENCY=>CE_LATENCY)

port map(CLK=>CLK,

--CE=>CE,

I0=>V(V'low+K),

I1=>V(V'low+K+M),

O=>O(O'low+2*K-1 to O'low+2*K));

end generate;

i0:if N mod 4=0 generate

dl:entity work.DELAY generic map(SIZE=>CE_LATENCY)

port map(CLK=>CLK,

I=>V(V'high to V'high),

O=>O(O'high to O'high));

end generate;

i2:if N mod 4=2 generate

dl:entity work.DELAY generic map(SIZE=>CE_LATENCY)

port map(CLK=>CLK,

I=>V(V'low+M to V'low+M),

O=>O(O'high to O'high));

end generate;

v0:entity work.BDELAY generic map(SIZE=>CE_LATENCY)

port map(CLK=>CLK,

I=>VM,

O=>VO);

end generate;

end TEST;

Now we can see some light at the end of the tunnel, we can introduce the the final top level MERGE_SORT block which turns out to be the simplest of all. It is also recursively defined in terms two smaller instances of itself plus the MERGE block we saw earlier. Here again the recursion stops at N=1 where MERGE_SORT just forwards the input to the output and N=2 where MERGE_SORT is one COMP_EXCH instance, as we know already that COMP_EXCH is a sorting network of size 2:

library IEEE;

use IEEE.STD_LOGIC_1164.ALL;

use work.SORTER_PKG.all;

-- MERGE_SORT takes an array as argument and returns the sorted array as a result

-- if N is even the array is divided into two equal parts

-- if N is odd the first part is one element longer than the second part

entity MERGE_SORT is

generic(CE_LATENCY:INTEGER:=0);

port(CLK:in STD_LOGIC:='0';

I:in T_VECTOR;

VI:in BOOLEAN:=TRUE;

O:out T_VECTOR;

VO:out BOOLEAN);

end MERGE_SORT;

architecture TEST of MERGE_SORT is

constant N:NATURAL:=I'length;

begin

assert I'length=O'length report "Ports I and O must have the same lenght!" severity error;

i1:if N=1 generate

O<=I;

VO<=VI;

end generate;

i2:if N=2 generate

cx:entity work.COMP_EXCH generic map(LATENCY=>CE_LATENCY)

port map(CLK=>CLK,

I0=>I(I'low),

I1=>I(I'high),

VI=>VI,

O=>O,

VO=>VO);

end generate;

i3:if N>2 generate

signal SL:T_VECTOR(I'low to I'low+(N+1)/2-1);

signal SH,SHD:T_VECTOR(I'low+(N+1)/2 to I'high);

signal S:T_VECTOR(I'range);

signal VS:BOOLEAN;

begin

ml:entity work.MERGE_SORT generic map(CE_LATENCY=>CE_LATENCY,

SIZE=>(N+1)/2)

port map(CLK=>CLK,

I=>I(I'low to I'low+(N+1)/2-1),

VI=>VI,

O=>SL,

VO=>VS);

mh:entity work.MERGE_SORT generic map(CE_LATENCY=>CE_LATENCY,

SIZE=>I'length-(N+1)/2)

port map(CLK=>CLK,

I=>I(I'low+(N+1)/2 to I'high),

VI=>VI,

O=>SH,

VO=>open);

dl:entity work.DELAY generic map(SIZE=>SORT_LATENCY(SL'length,CE_LATENCY)-SORT_LATENCY(SH'length,CE_LATENCY))

port map(CLK=>CLK,

I=>SH,

O=>SHD);

S<=SL&SHD;

me:entity work.MERGE generic map(CE_LATENCY=>CE_LATENCY)

port map(CLK=>CLK,

I=>S,

VI=>VS,

O=>O,

VO=>VO);

end generate;

end TEST;

So to recapitulate, we have created a parallel sorting network based on the Batcher or odd-even mergesort sorting algorithm, which divides the input set into two halves, sorts them separately (and recursively) and then merges them into a single sorted output. While not optimal in the sense of the least number of COMP_EXCH steps, mergesort is pretty close and more than good enough for the practical sizes one will encounter in hardware implementations. For this small price we paid in inefficiency we get a parameterizable implementation that can implement a sorting network of arbitrary size N (remember that we do not even know what the optimal sorting networks for sizes greater than N=10 are and there are no known algorithms to generate them programmatically). We also kept the code as generic as possible, with an arbitrary base type and a user defined comparison function, as well as a generic to control the latency of the entire thing, from a purely combinatorial implementation to a fully pipelined, very high speed one.

You may think that we have finally put this matter of sorting networks to rest but there a few things left. Since this post got quite long, next week we will look at how to instantiate this generic sorting network module and what kind of performance in terms of size and speed we can expect. Maybe we can improve on the design a bit too. Then the week after that we will try to answer a very difficult question - how do we know that this design actually works, especially for any arbitrary size N? Maybe it works for some smaller sizes but fails at larger ones, or maybe it sorts almost all possible input combinations but fails at some particular corner case, like already sorted or reverse sorted inputs. This is not a trivial question and it is a problem as interesting as the sorting network itself. We cannot move on to other subjects before we look into this design verification problem for sorting networks in more detail.

Back to the top: The Art of FPGA Design

Top Comments

fpgaguru over 5 years ago in reply to DAB

Data is not sorted one item at a time, but all N items at once in parallel. Every clock a new sorting operation starts and the throughput is one N-element list sorted every clock. If the latency for a given N is L clocks, then at any one time L completely independent sorting operations propagate through the sorting network in various stages of completion. Every clock a new set of N items is fed into the sorter and L clocks later it comes out sorted on the other end. This is a parallel implementation of a sorting algorithm and it would be impossible to describe it in a timing diagram form. It helps to think in terms of throughput and latency here - throughput is 1, meaning one new entire sorting operation every clock and latency is given by the SORT_LATENCY function in the SORTING_PKG package for any given N.
- Cancel
- Vote Up +1 Vote Down
- Sign in to reply
- More
- Cancel
DAB over 5 years ago in reply to fpgaguru

I am curious on how the data cycles through the FPGA components as you add a new object into the algorithm to be sorted.

When I look at hardware implementations I find it useful to understand the data propagation through the gates.

I am assuming that the tools provide something similar to look at FPGA throughput effects.

DAB
- Cancel
- Vote Up +1 Vote Down
- Sign in to reply
- More
- Cancel
fpgaguru over 5 years ago in reply to DAB

There is not much to show in a timing diagram. A new set of unsorted input data is applied at the input every clock and comes out sorted at the other end after some number of clocks of latency which depends on the sorting network size N. The SORT_LATENCY function in the SORTER_PKG package gives you this latency based on N. You can sort new sets of inputs back to back every clock. What exactly would you want to see in a timing diagram?
- Cancel
- Vote Up +1 Vote Down
- Sign in to reply
- More
- Cancel
DAB over 5 years ago

Nice post.

Are you going to work through a timing diagram for this algorithm?

DAB
- Cancel
- Vote Up +1 Vote Down
- Sign in to reply
- More
- Cancel