element14 Community
element14 Community
    Register Log In
  • Site
  • Search
  • Log In Register
  • Community Hub
    Community Hub
    • What's New on element14
    • Feedback and Support
    • Benefits of Membership
    • Personal Blogs
    • Members Area
    • Achievement Levels
  • Learn
    Learn
    • Ask an Expert
    • eBooks
    • element14 presents
    • Learning Center
    • Tech Spotlight
    • STEM Academy
    • Webinars, Training and Events
    • Learning Groups
  • Technologies
    Technologies
    • 3D Printing
    • FPGA
    • Industrial Automation
    • Internet of Things
    • Power & Energy
    • Sensors
    • Technology Groups
  • Challenges & Projects
    Challenges & Projects
    • Design Challenges
    • element14 presents Projects
    • Project14
    • Arduino Projects
    • Raspberry Pi Projects
    • Project Groups
  • Products
    Products
    • Arduino
    • Avnet Boards Community
    • Dev Tools
    • Manufacturers
    • Multicomp Pro
    • Product Groups
    • Raspberry Pi
    • RoadTests & Reviews
  • Store
    Store
    • Visit Your Store
    • Choose another store...
      • Europe
      •  Austria (German)
      •  Belgium (Dutch, French)
      •  Bulgaria (Bulgarian)
      •  Czech Republic (Czech)
      •  Denmark (Danish)
      •  Estonia (Estonian)
      •  Finland (Finnish)
      •  France (French)
      •  Germany (German)
      •  Hungary (Hungarian)
      •  Ireland
      •  Israel
      •  Italy (Italian)
      •  Latvia (Latvian)
      •  
      •  Lithuania (Lithuanian)
      •  Netherlands (Dutch)
      •  Norway (Norwegian)
      •  Poland (Polish)
      •  Portugal (Portuguese)
      •  Romania (Romanian)
      •  Russia (Russian)
      •  Slovakia (Slovak)
      •  Slovenia (Slovenian)
      •  Spain (Spanish)
      •  Sweden (Swedish)
      •  Switzerland(German, French)
      •  Turkey (Turkish)
      •  United Kingdom
      • Asia Pacific
      •  Australia
      •  China
      •  Hong Kong
      •  India
      •  Korea (Korean)
      •  Malaysia
      •  New Zealand
      •  Philippines
      •  Singapore
      •  Taiwan
      •  Thailand (Thai)
      • Americas
      •  Brazil (Portuguese)
      •  Canada
      •  Mexico (Spanish)
      •  United States
      Can't find the country/region you're looking for? Visit our export site or find a local distributor.
  • Translate
  • Profile
  • Settings
FPGA
  • Technologies
  • More
FPGA
Blog The Art of FPGA Design Season 2 - Post 16
  • Blog
  • Forum
  • Documents
  • Quiz
  • Events
  • Polls
  • Files
  • Members
  • Mentions
  • Sub-Groups
  • Tags
  • More
  • Cancel
  • New
FPGA requires membership for participation - click to join
  • Share
  • More
  • Cancel
Group Actions
  • Group RSS
  • More
  • Cancel
Engagement
  • Author Author: fpgaguru
  • Date Created: 9 Mar 2021 2:40 PM Date Created
  • Views 865 views
  • Likes 4 likes
  • Comments 0 comments
  • xilinx
  • fpgafeatured
  • fpga
  • dsp
  • guest writer
Related
Recommended

The Art of FPGA Design Season 2 - Post 16

fpgaguru
fpgaguru
9 Mar 2021

Multichannel  and Overclocking FIRs - The Single Rate non-Symmetric Case

 

We are looking now at the case of the single rate FIR filter where the sample rate is a sub-multiple of the FPGA clock rate. For example, let's say that the input and output sample rates of our single rate FIR of order N=8 are 200Msps, but we know we can run our FPGA DSP48s and fabric at 800MHz. We can take advantage of the extra FPGA speed in two ways. We can either implement four such filters for the price of one, that is using the same 8 DSP48s, or we can implement a single filter with only two DSP48s, M=4 times less resources, where M is the ratio between the fabric clock rate and the data sample rate. In both cases we start from the same mathematical model, a direct form, single rate non-symmetric FIR of order N=8:

image

The first case in which we implement four filters, in fact a filter with four channels which are time division multiplexed, is the easier one. We simply replace each delay in the single channel design with M delays:

image

We then pipeline this design using longitudinal and transversal pipeline cuts and we arrive at an efficient high sped and scalable implementation:

image

Like in previous posts, the resources that will fit inside each DSP48 are colored with the same background color and fabric resources have a white background. This structure is generic and scalable, a multi channel FIR of any order N and number of channels M can be built using this scheme and will still run at the maximum possible system clock frequency.

 

The second case, where we only need one channel but we want to reduce the resource utilization by a factor of M, is more complicated. The basic idea is that we will use one DSP48 to implement M filter taps in M consecutive clock cycles. The DSP48 multiplier will multiply M consecutive samples with different coefficient values, which will have to be stored in a ROM look-up table, while the DSP48 post adder will have to implement an accumulator which will add together the M partial products sequentially and then all these partial sums will have to be added together. Something like this:

image

The problem with this scheme is that there are M+1 adders in each sub-block, but we have only one adder and M clocks to do all the additions. The key observation is that the first adder in each sub-chain is not really needed because it adds the first partial product with a zero constant. So we could rearrange the adders like this to require only M additions every M clocks.

image

But this still leaves us with another issue. The partial sum coming from the upstream sub-block will be available in clock M-1, however, the next block downstream needs it in clock 0, to add it with the first partial product, because in the next M-1 clocks the adder must be used as an accumulator for each new partial product. The input delay line will be broken into sections of M delays and a new sample will be pushed into these registers only once every M clocks as indicated by the valid data input port DVI. A multiplexer selects every sample in this delay line for M consecutive clocks, which is then multiplied with a different coefficient read from a ROM table. The data multiplexer and the coefficient ROM are both addressed by the same counter, which counts continuously from 0 to M-1. To keep the counter in sync with the data being shifted in, the DVI signal is also used as a synchronous reset for the address counter. The partial products are accumulated with an accumulator built with an adder and a register. For M-1 clocks, the accumulator will add a new product to the partial sum. For one clock, however, the value coming from the upstream accumulator is added with the first product of a set of M, which is also controlled by the DVI signal.

 

The problem mentioned earlier is solved by delaying the shift register data and the data valid signal from one block to the next by M-1 clocks. So the DVI of the downstream block is active M-1 clocks after the DVI of the block upstream of it. To avoid using M-1 registers for the data we use a clock enabled register to capture to proper value to be sent from one block to the next. It is a single register, but that sample is in fact delayed by M-1 clocks:

image

This is a much more complicated design, but it still does what the original filter did, except that it requires M clocks instead of one to compute an output sample but using M times fewer multipliers and adders, so you might need to think about this for a bit. The next step is to pipeline this design structure, which is called an overclocked implementation or a semi-parallel one. We do this using pipeline cuts, in this case three longitudinal ones and a transversal one for every DSP48. After inserting all these registers and doing some rearrangement using register pushing we arrive at the following implementation:

image

Once again, the blocks are colored based on the DSP48 they belong to. To achieve the highest possible clock rate, the addressable shift register mux and the coefficient ROM both have their own pipeline registers. Every fabric LUT6 has two FF associated with it and can hold either two coefficient bits or two data bits. For this reason, the resources going into the same fabric slices are also colored with the same background color.

 

This structure too is also very efficient, generic and scalable to any N and M combination. For the case where N is not divisible by M, the filter coefficients must be zero padded until the new filter order becomes a multiple of M.

 

Back to the top: The Art of FPGA Design Season 2

  • Sign in to reply
element14 Community

element14 is the first online community specifically for engineers. Connect with your peers and get expert answers to your questions.

  • Members
  • Learn
  • Technologies
  • Challenges & Projects
  • Products
  • Store
  • About Us
  • Feedback & Support
  • FAQs
  • Terms of Use
  • Privacy Policy
  • Legal and Copyright Notices
  • Sitemap
  • Cookies

An Avnet Company © 2025 Premier Farnell Limited. All Rights Reserved.

Premier Farnell Ltd, registered in England and Wales (no 00876412), registered office: Farnell House, Forge Lane, Leeds LS12 2NE.

ICP 备案号 10220084.

Follow element14

  • X
  • Facebook
  • linkedin
  • YouTube