**The basic building blocks on the DSP algorithm side**

The DSP algorithms that need to be implemented are relatively simple, mathematically speaking. They consist of three basic operations, additions, multiplications and vector or matrix indexing. We will use the classic FIR formula as an example to illustrate this:

Here we have two discrete-time signals, x[n] and y[n], the input and the output of our filter, where the index n represents time. The FIR filter itself is defined by its order N and the N constant coefficients h[k] with k=0..N-1. The formula gives us the math algorithm we need to compute y, given x and h. To do that we need N multiplications and N-1 additions for each output sample y[n]. We basically need to compute the convolution of the input data signal x[n] with the filter impulse response h[k].

The filter is linear, a fact described mathematically by the following formulas where H is the FIR filter’s transfer function and X is the Z transform of x:

**y***=h*x=h*(a1x1+a2x2)=a1(h*x1)+a2(h*x2)=a1y1+a2y**2*

*in the time domain, where * denotes convolution and *

*Y=HX=H(a1X1+a2X2)=a1HX1+a2HX2=a1Y1+a2Y**2*

in the Z frequency domain, where * Y1=HX1* and

**Y2=HX2 .**

What these formulas say is that if we apply a linear combination of two signals at the FIR input, the output is the same linear combination of the two individual responses. We have used here the convolution theorem, which transforms convolution in the time domain into multiplication in the frequency domain and the fact that multiplication and addition are commutative and associative over the field of complex numbers.

Another important property of the filter is that it is time invariant, which simply means that the coefficients h[k] are constant and do not change in time. They do not depend on the time index n, nor the input and output data. Finally, the filter is causal, y[n] is a function of the current input sample x[n] and its past values like x[n-1] but not of future values like x[n+1]. This would be perfectly OK from a mathematical point of view, since the formula exists outside time, but it would be physically impossible to implement without being able to time travel into the future.

The three fundamental building blocks of the vast majority of DSP algorithms are then adders, multipliers and delays. I hope their behaviour is self-explanatory and I will use the following graphical representation for them:

If we need to add more than two terms we can simply use an adder with more than two inputs. In mathematical terms we say that the addition is associative. In the algorithm world it is traditional to name the filter coefficients h[k], but in the hardware world we will call them c[k]. Think of them as the infinite precision, respectively the quantized version of our filter coefficients. They both represent the same thing, the filter coefficients, but they are not necessarily identical in a mathematical sense, due to quantization effects. The delays are represented by `z`^{−1} because that is the Z transform of the transfer function of a one clock delay register.We might also need to cascade multiple such delays in a delay line. If the intermediate values are not needed we can represent such a delay chain by a single multi-clock delay. Here is an example of three cascaded delays:

The first step is to move from the domain of math formulas into the real world. We will build a graphical representation of the mathematical algorithm, using the building blocks introduced above. If we take a look at the following schematic diagram of an FIR for the particular case when N=5 it should be obvious that it implements the FIR mathematical algorithm:

But there is a fundamental difference in the way time is handled in the two cases. The mathematical formula exists outside time and n is just one variable in the formula like any other. There, x[n] and y[n] represent *all* x and y samples, from infinite past to infinite future. In the diagram, on the other hand, x[n] and y[n] are the *current* values of those signals, just one snapshot taken at time n. The delay block gives us access to the past sample values, the older the samples you need, the more delay blocks you have to use. You need k delays to make input sample x[n-k] available at time n. Future samples are of course not available at all, there is no inverse delay register that could predict future values of x[n] or y[n]. A hardware implementation is possible only when the formula describes a causal filter transfer function.

This is just a first step on the way to a hardware implementation of the FIR DSP algorithm. As mentioned in an earlier post, the math formula operates with infinite precision real or complex numbers. Both data samples and filter coefficients will have to be quantized for a hardware implementation to become possible. We will ignore that for now and accept that in hardware we will have to work with limited precision numbers instead, integer, fixed point or floating point. Our hardware implementation can never match the mathematical algorithm exactly.

Another important aspect is the multiple input adder and the pipelining problem. The math formula is generic, in the sense that it works for any particular value of N. We definitely want to achieve the same in a hardware implementation. We do not want separate designs for every possible value of N, we will want a single hardware design with N as a parameter. We call such a design generic or parameterizable.

Another very desirable feature of a hardware design is scalability. We want the design’s performance to be independent of N and its size to scale linearly with N. The throughput, the ability of the design to process data at a maximum rate should not drop when N increases. Similarly, the device utilization or hardware resources being used, the number of adders, multipliers and delays, should grow linearly with N, not faster. While computations in the mathematical realm are instantaneous, they are done outside time. Signal time and computation time are two different things. In the actual world, signal time and computation time are either identical or related by a finite integer ratio. In any case, processing information and even simply moving data from one point to another takes time, the speed of light being what it is. In hardware everything takes a non-zero amount of time, there is no instantaneous computation like in the mathematical realm.

While it is possible to design an N-input adder in hardware, it will not be a scalable implementation, a larger adder will run slower than a smaller one. Fortunately, addition is associative, so an N-input adder can be decomposed into a chain of N-1 2-input adders like this:

You may argue that the first adder in the chain is redundant and we do not need it and you would be correct, from a mathematical point of view. You need N-1 adders, not N, to add N terms. But we can already see a repeating pattern emerging, we can build a filter of any size N by simply cascading N basic modules of this type, one after another, like Lego pieces: It would make sense to have all the modules identical, although the first adder and the last delay in the chain are redundant. We will see later that at least in the Xilinx FPGA architecture, once you use a multiplier you get the adder and the delay for free, so we are not really wasting anything. There are also deeper reasons why we would want to have that first input into the adder cascade, at some point we might need to replace the zero constant with a non-zero value for rounding purposes or even with another signal.

This structure is called the direct form implementation of the FIR equation. In a later post we will encounter a second, equivalent implementation called the transpose form, but we will focus on the direct form for now. It is scalable in terms of device utilization, but not in terms of speed. As N increases and we cascade more and more 2-input adders, it takes longer and longer to compute the final sum, so the throughput will be inversely proportional to N rather than constant. In future posts we will see that this problem too can be addressed, we can achieve scalable throughput at the cost of increased latency through the use of pipelining.

The next post will look at the same problem of the basic building blocks, but from the hardware implementation point of view, with the accent put on coming up with what needs to be done to come up with a generic and scalable FIR implementation. .

Back to the top: The Art of FPGA Design Season 2