Sorting Networks - The Batcher or odd-even mergesort sorting algorithm
Now that we have defined the FPGA design engineering problem - parallel sorting of a set of N items in one clock, that is one new sorting operation starting every clock - we need to chose a sorting algorithm.While ideal in terms of performance (number of compare-exchange operations respectively latency) optimal sorting networks are irregular and more importantly they cannot be generated programmatically for an arbitrary size N. Classic sequential O(N·log2N) sorting algorithms like QuickSort are not a good match for parallel hardware implementations.
A good choice turns out to be an O(N·(log2N)2) sorting algorithm called Batcher or odd-even merge sort https://en.wikipedia.org/wiki/Batcher_odd%E2%80%93even_mergesort. It may not be the most efficient for sorting very large sets of data but for the sizes we are talking about here where N is usually less than 100 it is quite good and for sizes less than 9 it produces results as good as the optimal sorting networks. More importantly, it can be easily defined recursively in terms of itself and so it lends very well to programmatic generation of sorting networks of arbitrary size.
Sorting algorithms can get pretty complicated so I will not provide here a formal definition or a proof of correctness, just a simple recursive definition of the algorithm so that we can later on get into the actual HDL coding and implementation. I will talk however in a later post about ways to check that the implementation works, including exhaustive testing testbenches for sorting networks.
The basic idea for the Batcher sorting algorithm is that when given an unsorted input list of N elements you divide it into two halves, you recursively sort the two halves using the Batcher algorithm, you concatenate the two results and then apply a final merge stage, which produces a sorted list of N items. If we call the actual sorting operation mergesort(I(1 to N)) then the sorted output will be:
O(1 to N)=mergesort(I(1 to N))=mergesort(merge(mergesort(1 to N/2)&mergesort(N/2+1 to N)))
Now all we need is a definition for merge(1 to N) and then we have a complete definition for mergesort(1 to N). To raise recursion to a new level, merge can also be recursively defined in terms of itself. Merge also splits its list of N inputs into two halves but along even and odd lines rather than first and second half like mergesort does. The same merge function is then applied to both even and odd halves and their outputs are concatenated back into an N-item list and fed into a stage of N/2-1 compare and exchange blocks, with the first and last item bypassing this comp_exch stage. Each one of the N/2-1 compare-exchanges takes two consecutive items in the list and swaps them if they are not in order, the result is a merged list of N items:
O(1 to N)=merge(1 to N)=comp_exch(merge(even(1 to N))&merge(odd(1 to N))) where every comp_exch block operates on two consecutive even and odd items in the list.
This is pretty deep, we have a function mergesort that is recursively defined in terms of itself plus another function merge, which in turn is recursively defined in terms of itself and comp_exch operations. In the end the entire mergesort function resolves only to comp_exchange operations so it is a sorting network. All recursive definitions need a way to terminate the recursion and when N=1 both mergesort(I(1 to 1) and merge(I(1 to 1)) are trivial operations, just passing in the one input data to the output. The recursion can also be terminated earlier, for example both mergesort(I(1 to 2)) and merge(I(1 to 2)) actually reduce to single comp_exch block and mergesort can also be terminated at larger N values if an optimal sorting network structure is used instead.
The Wikipedia page provides an example of a Batcher sorting network for N=8:
The left half of the diagram is made up of the two mergesort(N/2) calls and the right half is the recursive merge(N) call. The two smaller mergesorts are each one decomposed into two mergesort(N/4) followed by a merge(N/2). The merge(N) itself is made out of two merge(N/2) interleaved, followed by N/2-1 comp_exch blocks and the two merge(N/2) are also recursively defined in terms of two merge(N/4) and one comp_exch block each.
If we denote S(N) and M(N) as the number of comp_exch blocks used by a mergesort respectively a merge operating on a list of size N then the following double recursion formulas will give us the size of a Batcher sorting network of any size N:
M(N)=M(ceil(N2 ))+M(ƒ loor(N2 ))+ƒ loor(N−12 )
S(N)=S(ceil(N2 ))+S(ƒ loor(N2 ))+M(N)
So how good is this Batcher odd-even mergesort algorithm? The table below gives the number of com_exch blocks required for a merge respectively a mergesort for a given size N and compares the later with the best known optimal sorting networks, or for N>10 for which the optimal sorting networks are not even known the range the optimal sorting network is expected to be:
N | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
merge(N) | 0 | 1 | 2 | 3 | 5 | 6 | 8 | 9 | 12 | 14 | 16 | 17 | 20 | 22 | 24 | 25 | 29 |
mergesort(N) | 0 | 1 | 3 | 5 | 9 | 12 | 16 | 19 | 26 | 32 | 37 | 41 | 48 | 54 | 59 | 63 | 74 |
optimal sorting network | 0 | 1 | 3 | 5 | 9 | 12 | 16 | 19 | 25 | 29 | 33..35 | 37..39 | 41..45 | 45..51 | 49..56 | 53..60 | 58..71 |
Up to N=8 the Batcher algorithm produces results as good as the best know optimal sorting network and it can generate them programmatically. For larger sizes we start to see some differences but they are within 10% and this is a small price to pay to be able to generate recursively such sorting networks.
Now that we have selected the sorting algorithm, we are ready to implement it in VHDL, which will be the subject of next week's post.
Back to the top: The Art of FPGA Design