element14 Community
element14 Community
    Register Log In
  • Site
  • Search
  • Log In Register
  • About Us
  • Community Hub
    Community Hub
    • What's New on element14
    • Feedback and Support
    • Benefits of Membership
    • Personal Blogs
    • Members Area
    • Achievement Levels
  • Learn
    Learn
    • Ask an Expert
    • eBooks
    • element14 presents
    • Learning Center
    • Tech Spotlight
    • STEM Academy
    • Webinars, Training and Events
    • Learning Groups
  • Technologies
    Technologies
    • 3D Printing
    • FPGA
    • Industrial Automation
    • Internet of Things
    • Power & Energy
    • Sensors
    • Technology Groups
  • Challenges & Projects
    Challenges & Projects
    • Design Challenges
    • element14 presents Projects
    • Project14
    • Arduino Projects
    • Raspberry Pi Projects
    • Project Groups
  • Products
    Products
    • Arduino
    • Avnet Boards Community
    • Dev Tools
    • Manufacturers
    • Multicomp Pro
    • Product Groups
    • Raspberry Pi
    • RoadTests & Reviews
  • Store
    Store
    • Visit Your Store
    • Choose another store...
      • Europe
      •  Austria (German)
      •  Belgium (Dutch, French)
      •  Bulgaria (Bulgarian)
      •  Czech Republic (Czech)
      •  Denmark (Danish)
      •  Estonia (Estonian)
      •  Finland (Finnish)
      •  France (French)
      •  Germany (German)
      •  Hungary (Hungarian)
      •  Ireland
      •  Israel
      •  Italy (Italian)
      •  Latvia (Latvian)
      •  
      •  Lithuania (Lithuanian)
      •  Netherlands (Dutch)
      •  Norway (Norwegian)
      •  Poland (Polish)
      •  Portugal (Portuguese)
      •  Romania (Romanian)
      •  Russia (Russian)
      •  Slovakia (Slovak)
      •  Slovenia (Slovenian)
      •  Spain (Spanish)
      •  Sweden (Swedish)
      •  Switzerland(German, French)
      •  Turkey (Turkish)
      •  United Kingdom
      • Asia Pacific
      •  Australia
      •  China
      •  Hong Kong
      •  India
      •  Korea (Korean)
      •  Malaysia
      •  New Zealand
      •  Philippines
      •  Singapore
      •  Taiwan
      •  Thailand (Thai)
      • Americas
      •  Brazil (Portuguese)
      •  Canada
      •  Mexico (Spanish)
      •  United States
      Can't find the country/region you're looking for? Visit our export site or find a local distributor.
  • Translate
  • Profile
  • Settings
FPGA
  • Technologies
  • More
FPGA
Blog The Art of FPGA Design - Post 35
  • Blog
  • Forum
  • Documents
  • Quiz
  • Events
  • Polls
  • Files
  • Members
  • Mentions
  • Sub-Groups
  • Tags
  • More
  • Cancel
  • New
Join FPGA to participate - click to join for free!
  • Share
  • More
  • Cancel
Group Actions
  • Group RSS
  • More
  • Cancel
Engagement
  • Author Author: fpgaguru
  • Date Created: 16 Jul 2019 1:55 PM Date Created
  • Views 1280 views
  • Likes 3 likes
  • Comments 2 comments
  • fpga_featured
  • xilinx
  • vhdl
  • guest writer
Related
Recommended

The Art of FPGA Design - Post 35

fpgaguru
fpgaguru
16 Jul 2019

Sorting Networks - The results for the VHDL implementation of Batcher's sorting algorithm

 

So how good is this VHDL implementation of a parallel sorting network? Before we look at the results here are the two modules that were missing from the previous post, a generic DELAY module for elements to be sorted:

 

library ieee;

 

use ieee.std_logic_1164.ALL;

 

use work.SORTER_PKG.all;

 

entity DELAY is

  generic(SIZE:INTEGER:=1);

  port(CLK:in STD_LOGIC;

       I:in T_VECTOR;

       O:out T_VECTOR);

end DELAY;

 

architecture TEST of DELAY is

begin

  i0:if SIZE=0 generate

       O<=I;

     end generate;

 

  il:if SIZE>0 generate

       type T_MATRIX is array(INTEGER range <>) of T_VECTOR(O'range);

       signal D:T_MATRIX(1 to SIZE):=(others=>(others=>T_ZERO));

       attribute keep:STRING;

       attribute keep of D:signal is "TRUE";

     begin

       process(CLK)

       begin

         if rising_edge(CLK) then

           for K in D'range loop

             if K=D'low then

               D(K)<=I;

             else

               D(K)<=D(K-1);

             end if;

           end loop;

         end if;

       end process;

       O<=D(D'high);

     end generate;

end TEST;

 

and a generic BDELAY block for BOOLEAN signals:

 

library ieee;

 

use ieee.std_logic_1164.ALL;

 

entity BDELAY is

  generic(SIZE:INTEGER:=1);

  port(CLK:in STD_LOGIC:='0';

       I:in BOOLEAN;

       O:out BOOLEAN);

end BDELAY;

 

architecture TEST of BDELAY is

begin

  i0:if SIZE=0 generate

       O<=I;

     end generate;

 

  il:if SIZE>0 generate

       type BOOLEAN_VECTOR is array(INTEGER range <>) of BOOLEAN;

       signal D:BOOLEAN_VECTOR(1 to SIZE):=(others=>FALSE);

     begin

       process(CLK)

       begin

         if rising_edge(CLK) then

           for K in D'range loop

             if K=D'low then

               D(K)<=I;

             else

               D(K)<=D(K-1);

             end if;

           end loop;

         end if;

       end process;

       O<=D(D'high);

     end generate;

end TEST;

 

Now that we have a complete set of source files for the generic sorting network design we only need a top level test module, which defines the size N of the network (remember that the basic type is also user definable in the SORTER_PKG package). We will use for a test a sorting network of size N=16, sorting UNSIGNED(7 downto 0) elements:

 

library IEEE;

 

use IEEE.STD_LOGIC_1164.ALL;

 

use work.SORTER_PKG.all;

 

-- MERGE_SORT takes an array as argument and returns the sorted array as a result

-- if N is even the array is divided into two equal parts

-- if N is odd the first part is one element longer than the second part

entity TEST_MERGE_SORT is

  generic(CE_LATENCY:INTEGER:=1;

          SIZE:INTEGER:=16);

  port(CLK:in STD_LOGIC:='0';

       I:in T_VECTOR(1 to SIZE);

       VI:in BOOLEAN:=TRUE;

       O:out T_VECTOR(1 to SIZE);

       VO:out BOOLEAN);

end TEST_MERGE_SORT;

 

architecture TEST of TEST_MERGE_SORT is

  signal RI:T_VECTOR(I'range);

  signal RVI:BOOLEAN;

begin

  process(CLK)

  begin

    if rising_edge(CLK) then

      RI<=I;

      RVI<=VI;

    end if;

  end process;

 

  ms:entity work.MERGE_SORT generic map(CE_LATENCY=>CE_LATENCY)

                            port map(CLK=>CLK,

                                     I=>RI,

                                     VI=>RVI,

                                     O=>O,

                                     VO=>VO);

end TEST;

 

The registered inputs RI and RVI are not part of the design itself and are there just as placeholders for the logic driving the sorting network inputs so that we can get an accurate timing estimate, we will subtract these registers from the utilization numbers.

 

The implemented design for N=16 uses 757 LUTs and 1281 FFs, which is really insignificant, even for the smallest FPGA. More importantly, this is a very fast design, it will close timing in the fastest speed grade Ultrascale+ FPGA at 967 MHz, which is more then the maximum allowed clock speed of 891 MHz for this device family. What this means is that we can sort a new set of 16 8-bit unsigned numbers every 1.1 ns. We are achieving something that is not even remotely possible with a software based algorithm running on a CPU, while using well less than 1% of even a small FPGA.

 

Yes, N=16 is actually a very small test case, so what about a larger sorting network, say N=64? All we have to do is change the SIZE generic in TEST_MERGE_SORT and re-implement. The design is using now 6997 LUTs and 10753 FFs and closes timing at 934 MHz, still over the maximum possible frequency of 891MHz.

 

Let's raise the bar even higher now for a really big test case, for example N=1024? This is again very easy to test, just by changing the value of the SIZE generic to 1024. We can see here the power of recursive component instantiations, the N=1024 design has now 66052 sub-components, all generated from less than 200 lines of VHDL code! This is a big design, the size has increased significantly to 352366 LUTs and 450561 FFs which is 83% of a medium size FPGA like ZCU17EG and the clock speed also got lower at 503MHz but keep in mind that we are sorting a new set of 1024 8-bit unsigned values every 2.0 ns and we could instantiate multiple such sorting networks in a larger FPGA, all of them running in parallel if we need to increase throughput even further. So this is still a tremendous amount of computations done in parallel at a very high rate.

 

We are close to the end of this sorting network saga, I will conclude it next week with the answer to a really important question - how do we know this design actually does what it is supposed to do? How do you verify a sorting network of size 1024?

 

Back to the top: The Art of FPGA Design

  • Sign in to reply

Top Comments

  • michaelkellett
    michaelkellett over 6 years ago +2
    I like your concept of small FPGA, to me a small FPGA might be a Lattice ICE 40 with 2k LUTs. I've never even seen a 450k LUT FPGA, let alone soldered one in. The first FPGA I used was a Xilinx 2064, 8…
Parents
  • michaelkellett
    michaelkellett over 6 years ago

    I like your concept of small FPGA, to me a small FPGA might be a Lattice ICE 40 with 2k LUTs. I've never even seen a 450k LUT FPGA, let alone soldered one in.

    The first FPGA I used was a Xilinx 2064, 8 x 8 matrix of 64 LUts !

     

    It's interesting to see what other people do.

     

    I googled ZCU17EG but didn't get a useful hit.

     

    MK

    • Cancel
    • Vote Up +2 Vote Down
    • Sign in to reply
    • More
    • Cancel
Comment
  • michaelkellett
    michaelkellett over 6 years ago

    I like your concept of small FPGA, to me a small FPGA might be a Lattice ICE 40 with 2k LUTs. I've never even seen a 450k LUT FPGA, let alone soldered one in.

    The first FPGA I used was a Xilinx 2064, 8 x 8 matrix of 64 LUts !

     

    It's interesting to see what other people do.

     

    I googled ZCU17EG but didn't get a useful hit.

     

    MK

    • Cancel
    • Vote Up +2 Vote Down
    • Sign in to reply
    • More
    • Cancel
Children
  • fpgaguru
    fpgaguru over 6 years ago in reply to michaelkellett

    Hi Michael,

     

    Small is a relative thing. The smallest Zynq MPSoC FPGA, ZU2EG has 47K LUTs and 94K FFs. That's on top of 5Mb of BRAMs and 240 DSP48E2s, plus a 1.5GHz quad-core ARM Cortex-A53 and a 600MHz dual-core ARM Cortex-R5, plus many other things. The ZU17EG is at the higher end of the Zynq MPSoC range, with almost 1M FFs and 0.5M LUTs. You can find info about the entire range here https://www.xilinx.com/support/documentation/selection-guides/zynq-ultrascale-plus-product-selection-guide.pdf  - all these are "small" FPGAs, you have to take a look at Virtex UltraScale+ or the new 7nm Versal families if you want to see really large FPGAs.

     

    Catalin

    • Cancel
    • Vote Up 0 Vote Down
    • Sign in to reply
    • More
    • Cancel
element14 Community

element14 is the first online community specifically for engineers. Connect with your peers and get expert answers to your questions.

  • Members
  • Learn
  • Technologies
  • Challenges & Projects
  • Products
  • Store
  • About Us
  • Feedback & Support
  • FAQs
  • Terms of Use
  • Privacy Policy
  • Legal and Copyright Notices
  • Sitemap
  • Cookies

An Avnet Company © 2025 Premier Farnell Limited. All Rights Reserved.

Premier Farnell Ltd, registered in England and Wales (no 00876412), registered office: Farnell House, Forge Lane, Leeds LS12 2NE.

ICP 备案号 10220084.

Follow element14

  • X
  • Facebook
  • linkedin
  • YouTube