element14 Community
element14 Community
    Register Log In
  • Site
  • Search
  • Log In Register
  • Community Hub
    Community Hub
    • What's New on element14
    • Feedback and Support
    • Benefits of Membership
    • Personal Blogs
    • Members Area
    • Achievement Levels
  • Learn
    Learn
    • Ask an Expert
    • eBooks
    • element14 presents
    • Learning Center
    • Tech Spotlight
    • STEM Academy
    • Webinars, Training and Events
    • Learning Groups
  • Technologies
    Technologies
    • 3D Printing
    • FPGA
    • Industrial Automation
    • Internet of Things
    • Power & Energy
    • Sensors
    • Technology Groups
  • Challenges & Projects
    Challenges & Projects
    • Design Challenges
    • element14 presents Projects
    • Project14
    • Arduino Projects
    • Raspberry Pi Projects
    • Project Groups
  • Products
    Products
    • Arduino
    • Avnet Boards Community
    • Dev Tools
    • Manufacturers
    • Multicomp Pro
    • Product Groups
    • Raspberry Pi
    • RoadTests & Reviews
  • Store
    Store
    • Visit Your Store
    • Choose another store...
      • Europe
      •  Austria (German)
      •  Belgium (Dutch, French)
      •  Bulgaria (Bulgarian)
      •  Czech Republic (Czech)
      •  Denmark (Danish)
      •  Estonia (Estonian)
      •  Finland (Finnish)
      •  France (French)
      •  Germany (German)
      •  Hungary (Hungarian)
      •  Ireland
      •  Israel
      •  Italy (Italian)
      •  Latvia (Latvian)
      •  
      •  Lithuania (Lithuanian)
      •  Netherlands (Dutch)
      •  Norway (Norwegian)
      •  Poland (Polish)
      •  Portugal (Portuguese)
      •  Romania (Romanian)
      •  Russia (Russian)
      •  Slovakia (Slovak)
      •  Slovenia (Slovenian)
      •  Spain (Spanish)
      •  Sweden (Swedish)
      •  Switzerland(German, French)
      •  Turkey (Turkish)
      •  United Kingdom
      • Asia Pacific
      •  Australia
      •  China
      •  Hong Kong
      •  India
      •  Korea (Korean)
      •  Malaysia
      •  New Zealand
      •  Philippines
      •  Singapore
      •  Taiwan
      •  Thailand (Thai)
      • Americas
      •  Brazil (Portuguese)
      •  Canada
      •  Mexico (Spanish)
      •  United States
      Can't find the country/region you're looking for? Visit our export site or find a local distributor.
  • Translate
  • Profile
  • Settings
FPGA
  • Technologies
  • More
FPGA
Blog The Art of FPGA Design - Post 34
  • Blog
  • Forum
  • Documents
  • Quiz
  • Events
  • Polls
  • Files
  • Members
  • Mentions
  • Sub-Groups
  • Tags
  • More
  • Cancel
  • New
Join FPGA to participate - click to join for free!
  • Share
  • More
  • Cancel
Group Actions
  • Group RSS
  • More
  • Cancel
Engagement
  • Author Author: fpgaguru
  • Date Created: 7 Jul 2019 3:48 PM Date Created
  • Views 1432 views
  • Likes 2 likes
  • Comments 4 comments
  • fpga_featured
  • xilinx
  • vhdl
  • guest writer
Related
Recommended

The Art of FPGA Design - Post 34

fpgaguru
fpgaguru
7 Jul 2019

Sorting Networks - The VHDL implementation of Batcher's sorting algorithm

 

OK, enough with the preliminaries, it's time now for the real deal, how do we implement in an FPGA this parallel sorting network thing. The main design goals are creating a generic, reusable and efficient implementation. We want to be able to implement a sorting network of any size N with a single piece of code, we want to be able to sort all kinds of data, not just integers and we want something close in size to the best known optimal sorting networks. Of course, we also want to run as fast as possible, at least at 500MHz if not faster, sorting one set of N elements every clock. It looks like a pretty tall order so let's try to do to it.

 

We will start with a VHDL package, which defines the basic data type,as well as an unconstrained array of such abstract type and a function to compare two such elements and decide which one is "smaller". This is VHDL-2008 syntax, these definitions will not work in a VHDL-93 environment. We will use unconstrained UNSIGNED as the basic type and the "<" or "less than" comparison function as an example but in theory we are free to use a wide range of basic types and comparison functions here. To sort different kinds of items other than the UNSIGNED(7 downto 0) I am using as an example you would have to make changes here. This package also includes some helper functions to calculate the latencies of various blocks, which will come handy later on. They will be used to equalize the latencies of slightly different parallel paths of data when they will need to be merged together:

 

library IEEE;

 

use IEEE.STD_LOGIC_1164.all;

use IEEE.NUMERIC_STD.all;


package SORTER_PKG is

  subtype T is UNSIGNED(7 downto 0);                           -- define your own base type T here

  constant T_ZERO:T:=(others=>'0');                            -- depends on actual type T

  function "<"(L,R:T) return BOOLEAN;                          -- define compare function for type T

  type T_VECTOR is array(INTEGER range <>) of T;               -- unconstrained vector of abstract base type T

  function MERGE_LATENCY(N,CE_LATENCY:INTEGER) return INTEGER; -- the latency of a sorting network of size N, given the latency of the COMP_EXCH block

  function SORT_LATENCY(N,CE_LATENCY:INTEGER) return INTEGER;  -- the latency of a merge operation of size N, given the latency of the COMP_EXCH block

end SORTER_PKG;


package body SORTER_PKG is

  function "<"(L,R:T) return BOOLEAN is

  begin

    return IEEE.NUMERIC_STD."<"(L,R); -- to avoid infinite recursion loop when "<" is already defined for T

  end;


  function SORT_LATENCY(N,CE_LATENCY:INTEGER) return INTEGER is

  begin

    case N is

      when 1=>return 0;

      when 2=>return CE_LATENCY;

      when others=>return SORT_LATENCY((N+1)/2,CE_LATENCY)+MERGE_LATENCY(N,CE_LATENCY);

    end case;

  end;

 

  function MERGE_LATENCY(N,CE_LATENCY:INTEGER) return INTEGER is

  begin

    case N is

      when 1=>return 0;

      when 2=>return CE_LATENCY;

      when others=>return MERGE_LATENCY((N+1)/2-1,CE_LATENCY)+CE_LATENCY;

    end case;

  end;

end SORTER_PKG;

The next step up the bottom-up design ladder is a "compare and exchange" component. COMP_EXCH has two inputs and two outputs and either swaps or passes the two inputs to the outputs so that they are always in ascending order. Another way to look at COMP_EXCH is as a MIN(I0,I1) function driving the first output and a MAX(I0,I1) driving the second output but the best hardware implementation uses a single comparator and two 2:1 muxes. There is also a generic to control the latency of this building block, we can get a combinatorial implementation if we want or introduce one or more register levels for pipelining to increase throughput. Generic DELAY and BDELAY blocks are used for this purpose, of the kind we discussed in posts 8 and 9. You can think of this COMP_EXCH block as the solution for our sorting network problem for the very particular and easy case of N=2:

 

library IEEE;

 

use IEEE.STD_LOGIC_1164.ALL;


use work.SORTER_PKG.all;


entity COMP_EXCH is

  generic(LATENCY:INTEGER:=0);

  port(CLK:in STD_LOGIC:='0';

       I0,I1:in T;

       VI:in BOOLEAN:=TRUE;

       O:out T_VECTOR(0 to 1);

       VO:out BOOLEAN);

end COMP_EXCH;


architecture TEST of COMP_EXCH is

  signal I:T_VECTOR(O'range);

begin

  I<=(I1,I0) when I1<I0 else (I0,I1);

  dl:entity work.DELAY generic map(SIZE=>LATENCY)

                       port map(CLK=>CLK,

                                I=>I,

                                O=>O);

  bd:entity work.BDELAY generic map(SIZE=>LATENCY)

                        port map(CLK=>CLK,

   I=>VI,

                                 O=>VO);

end TEST;

 

The next step is a generic MERGE block. This takes an input of an arbitrary size N, with the lower and upper halves already sorted and merges them into a fully sorted output. This is a recursive component definition, defined in terms of smaller instances of itself, with the recursion stopping at N=1, when MERGE simply passes its one input to the output and N=2 when MERGE reduces to one COMP_EXCH block:

 

library IEEE;

 

use IEEE.STD_LOGIC_1164.ALL;

 

use work.SORTER_PKG.all;

 

-- MERGE takes two sorted arrays concatenated as an argument and returns a sorted array as a result

-- if N is even the two sorted arrays are of equal lengths

-- if N is odd the first sorted array is one element longer than the second one

entity MERGE is

  generic(CE_LATENCY:INTEGER:=0);

  port(CLK:in STD_LOGIC:='0';

       I:in T_VECTOR;

       VI:in BOOLEAN:=TRUE;

       O:out T_VECTOR;

       VO:out BOOLEAN);

end MERGE;

 

architecture TEST of MERGE is

  constant N:NATURAL:=I'length;

  constant M:INTEGER:=N/2-(N-1) mod 4/3; -- N/2-1 when N mod 4=0 else N/2

 

-- EVEN returns an array composed of the even elements of the argument (the first element of I is always considered even)

  function EVEN(I:T_VECTOR) return T_VECTOR is

    constant N:NATURAL:=I'length;

    variable Result:T_VECTOR(I'low to I'low+(N+1)/2-1);

  begin

    for K in Result'range loop

      Result(K):=I(I'low+2*(K-Result'low));

    end loop;

    return Result;

  end;

-- ODD returns an array composed of the odd elements of the argument (the first element of I is always considered even)

  function ODD(I:T_VECTOR) return T_VECTOR is

    constant N:NATURAL:=I'length;

    variable Result:T_VECTOR(I'low to I'low+N/2-1);

  begin

    for K in Result'range loop

      Result(K):=I(I'low+2*(K-Result'low)+1);

    end loop;

    return Result;

  end;

begin

  assert I'length=O'length report "Ports I and O must have the same lenght!" severity error;

  i1:if N=1 generate

       O<=I;

       VO<=VI;

     end generate; 

  i2:if N=2 generate

       cx:entity work.COMP_EXCH generic map(LATENCY=>CE_LATENCY)

                                port map(CLK=>CLK,

                                         I0=>I(I'low),

                                         I1=>I(I'high),

                                         VI=>VI,

                                         O=>O,

                                         VO=>VO);

     end generate; 

  i3:if N>2 generate

       signal IL,ML:T_VECTOR(I'low to I'low+(N+1)/2-1+(N+1) mod 4/3);

       signal IH,MH,MHD:T_VECTOR(I'low+(N+1)/2+(N+1) mod 4/3 to I'high);

       signal V:T_VECTOR(I'range);

       signal VM:BOOLEAN;

     begin

       IL<=EVEN(I(I'low to I'low+(N+1)/2-1))&EVEN(I(I'low+(N+1)/2 to I'high));

       lm:entity work.MERGE generic map(CE_LATENCY=>CE_LATENCY)

                            port map(CLK=>CLK,

                                     I=>IL,

                                     VI=>VI,

                                     O=>ML,

                                     VO=>VM);

       IH<=ODD(I(I'low to I'low+(N+1)/2-1))&ODD(I(I'low+(N+1)/2 to I'high));

       hm:entity work.MERGE generic map(CE_LATENCY=>CE_LATENCY)

                            port map(CLK=>CLK,

                                     I=>IH,

                                     VI=>VI,

                                     O=>MH,

                                     VO=>open);

       dl:entity work.DELAY generic map(SIZE=>MERGE_LATENCY(ML'length,CE_LATENCY)-MERGE_LATENCY(MH'length,CE_LATENCY))

                            port map(CLK=>CLK,

                                     I=>MH,

                                     O=>MHD);

       V<=ML&MHD;

       d0:entity work.DELAY generic map(SIZE=>CE_LATENCY)

                            port map(CLK=>CLK,

                                     I=>V(V'low to V'low),

                                     O=>O(O'low to O'low));

       lk:for K in 1 to (N+1)/2-1 generate

            cx:entity work.COMP_EXCH generic map(LATENCY=>CE_LATENCY)

                                     port map(CLK=>CLK,

                                              --CE=>CE,

                                              I0=>V(V'low+K),

                                              I1=>V(V'low+K+M),

                                              O=>O(O'low+2*K-1 to O'low+2*K));

          end generate;

       i0:if N mod 4=0 generate

            dl:entity work.DELAY generic map(SIZE=>CE_LATENCY)

                                 port map(CLK=>CLK,

                                          I=>V(V'high to V'high),

                                          O=>O(O'high to O'high));

          end generate;

       i2:if N mod 4=2 generate

            dl:entity work.DELAY generic map(SIZE=>CE_LATENCY)

                                 port map(CLK=>CLK,

                                          I=>V(V'low+M to V'low+M),

                                          O=>O(O'high to O'high));

          end generate;

       v0:entity work.BDELAY generic map(SIZE=>CE_LATENCY)

                             port map(CLK=>CLK,

                                      I=>VM,

                                      O=>VO);

     end generate; 

end TEST;

Now we can see some light at the end of the tunnel, we can introduce the the final top level MERGE_SORT block which turns out to be the simplest of all. It is also recursively defined in terms two smaller instances of itself plus the MERGE block we saw earlier. Here again the recursion stops at N=1 where MERGE_SORT just forwards the input to the output and N=2 where MERGE_SORT is one COMP_EXCH instance, as we know already that COMP_EXCH is a sorting network of size 2:

library IEEE;

 

use IEEE.STD_LOGIC_1164.ALL;

 

use work.SORTER_PKG.all;

 

-- MERGE_SORT takes an array as argument and returns the sorted array as a result

-- if N is even the array is divided into two equal parts

-- if N is odd the first part is one element longer than the second part

entity MERGE_SORT is

  generic(CE_LATENCY:INTEGER:=0);

  port(CLK:in STD_LOGIC:='0';

       I:in T_VECTOR;

       VI:in BOOLEAN:=TRUE;

       O:out T_VECTOR;

       VO:out BOOLEAN);

end MERGE_SORT;

 

architecture TEST of MERGE_SORT is

  constant N:NATURAL:=I'length;

begin

  assert I'length=O'length report "Ports I and O must have the same lenght!" severity error;

  i1:if N=1 generate

       O<=I;

       VO<=VI;

     end generate; 

  i2:if N=2 generate

       cx:entity work.COMP_EXCH generic map(LATENCY=>CE_LATENCY)

                                port map(CLK=>CLK,

                                         I0=>I(I'low),

                                         I1=>I(I'high),

                                         VI=>VI,

                                         O=>O,

                                         VO=>VO);

     end generate; 

  i3:if N>2 generate

       signal SL:T_VECTOR(I'low to I'low+(N+1)/2-1);

       signal SH,SHD:T_VECTOR(I'low+(N+1)/2 to I'high);

       signal S:T_VECTOR(I'range);

       signal VS:BOOLEAN;

     begin

       ml:entity work.MERGE_SORT generic map(CE_LATENCY=>CE_LATENCY,

                                             SIZE=>(N+1)/2)

                                 port map(CLK=>CLK,

                                          I=>I(I'low to I'low+(N+1)/2-1),

                                          VI=>VI,

                                          O=>SL,

                                          VO=>VS);

       mh:entity work.MERGE_SORT generic map(CE_LATENCY=>CE_LATENCY,

                                             SIZE=>I'length-(N+1)/2)

                                 port map(CLK=>CLK,

                                          I=>I(I'low+(N+1)/2 to I'high),

                                          VI=>VI,

                                          O=>SH,

                                          VO=>open);

       dl:entity work.DELAY generic map(SIZE=>SORT_LATENCY(SL'length,CE_LATENCY)-SORT_LATENCY(SH'length,CE_LATENCY))

                            port map(CLK=>CLK,

                                     I=>SH,

                                     O=>SHD);

       S<=SL&SHD;

       me:entity work.MERGE generic map(CE_LATENCY=>CE_LATENCY)

                            port map(CLK=>CLK,

                                     I=>S,

                                     VI=>VS,

                                     O=>O,

                                     VO=>VO);

     end generate; 

end TEST;

So to recapitulate, we have created a parallel sorting network based on the Batcher or odd-even mergesort sorting algorithm, which divides the input set into two halves, sorts them separately (and recursively) and then merges them into a single sorted output. While not optimal in the sense of the least number of COMP_EXCH steps, mergesort is pretty close and more than good enough for the practical sizes one will encounter in hardware implementations. For this small price we paid in inefficiency we get a parameterizable implementation that can implement a sorting network of arbitrary size N (remember that we do not even know what the optimal sorting networks for sizes greater than N=10 are and there are no known algorithms to generate them programmatically). We also kept the code as generic as possible, with an arbitrary base type and a user defined comparison function, as well as a generic to control the latency of the entire thing, from a purely combinatorial implementation to a fully pipelined, very high speed one.

 

You may think that we have finally put this matter of sorting networks to rest but there a few things left. Since this post got quite long, next week we will look at how to instantiate this generic sorting network module and what kind of performance in terms of size and speed we can expect. Maybe we can improve on the design a bit too. Then the week after that we will try to answer a very difficult question - how do we know that this design actually works, especially for any arbitrary size N? Maybe it works for some smaller sizes but fails at larger ones, or maybe it sorts almost all possible input combinations but fails at some particular corner case, like already sorted or reverse sorted inputs. This is not a trivial question and it is a problem as interesting as the sorting network itself. We cannot move on to other subjects before we look into this design verification problem for sorting networks in more detail.

 

Back to the top: The Art of FPGA Design

  • Sign in to reply

Top Comments

  • DAB
    DAB over 6 years ago +1
    Nice post. Are you going to work through a timing diagram for this algorithm? DAB
  • fpgaguru
    fpgaguru over 6 years ago in reply to DAB +1
    There is not much to show in a timing diagram. A new set of unsorted input data is applied at the input every clock and comes out sorted at the other end after some number of clocks of latency which depends…
  • DAB
    DAB over 6 years ago in reply to fpgaguru +1
    I am curious on how the data cycles through the FPGA components as you add a new object into the algorithm to be sorted. When I look at hardware implementations I find it useful to understand the data…
  • fpgaguru
    fpgaguru over 6 years ago in reply to DAB

    Data is not sorted one item at a time, but all N items at once in parallel. Every clock a new sorting operation starts and the throughput is one N-element list sorted every clock. If the latency for a given N is L clocks, then at any one time L completely independent sorting operations propagate through the sorting network in various stages of completion. Every clock a new set of N items is fed into the sorter and L clocks later it comes out sorted on the other end. This is a parallel implementation of a sorting algorithm and it would be impossible to describe it in a timing diagram form. It helps to think in terms of throughput and latency here - throughput is 1, meaning one new entire sorting operation every clock and latency is given by the SORT_LATENCY function in the SORTING_PKG package for any given N.

    • Cancel
    • Vote Up +1 Vote Down
    • Sign in to reply
    • More
    • Cancel
  • DAB
    DAB over 6 years ago in reply to fpgaguru

    I am curious on how the data cycles through the FPGA components as you add a new object into the algorithm to be sorted.

     

    When I look at hardware implementations I find it useful to understand the data propagation through the gates.

     

    I am assuming that the tools provide something similar to look at FPGA throughput effects.

     

    DAB

    • Cancel
    • Vote Up +1 Vote Down
    • Sign in to reply
    • More
    • Cancel
  • fpgaguru
    fpgaguru over 6 years ago in reply to DAB

    There is not much to show in a timing diagram. A new set of unsorted input data is applied at the input every clock and comes out sorted at the other end after some number of clocks of latency which depends on the sorting network size N. The SORT_LATENCY function in the SORTER_PKG package gives you this latency based on N. You can sort new sets of inputs back to back every clock. What exactly would you want to see in a timing diagram? 

    • Cancel
    • Vote Up +1 Vote Down
    • Sign in to reply
    • More
    • Cancel
  • DAB
    DAB over 6 years ago

    Nice post.

     

    Are you going to work through a timing diagram for this algorithm?

     

    DAB

    • Cancel
    • Vote Up +1 Vote Down
    • Sign in to reply
    • More
    • Cancel
element14 Community

element14 is the first online community specifically for engineers. Connect with your peers and get expert answers to your questions.

  • Members
  • Learn
  • Technologies
  • Challenges & Projects
  • Products
  • Store
  • About Us
  • Feedback & Support
  • FAQs
  • Terms of Use
  • Privacy Policy
  • Legal and Copyright Notices
  • Sitemap
  • Cookies

An Avnet Company © 2025 Premier Farnell Limited. All Rights Reserved.

Premier Farnell Ltd, registered in England and Wales (no 00876412), registered office: Farnell House, Forge Lane, Leeds LS12 2NE.

ICP 备案号 10220084.

Follow element14

  • X
  • Facebook
  • linkedin
  • YouTube