SystemVerilog Study Notes. Simplified Floating Point Arithmetic. RTL Combinational Circuit

1 Sep 2022

SystemVerilog Study Notes. Simplified Floating Point Arithmetic. RTL Combinational Circuit

We continue with combinational circuit design exercises in SystemVerilog. This time we are going to do exercises on number representation formats using a simplified floating point format.

Floating point arithmetic
Simplified 13-bit format
Simplified floating point adder
SystemVerilog Study Notes Chapters

SystemVerilog Study Notes Chapters

Floating point arithmetic

Floating-point arithmetic (FP) is arithmetic using formulaic representation of real numbers as an approximation to support a trade-off between range and precision. Floating point is another format to represent a number. With the same number of bits, the range in floating-point format is much larger than in signed integer format. In general, a floating-point number is represented approximately with a fixed number of significant digits (the significand) and scaled using an exponent in some fixed base; the base for the scaling is normally two, ten, or sixteen. A number that can be represented exactly is of the following form:

where significand is an integer, base is an integer greater than or equal to two, and exponent is also an integer.

For example:

SystemVerilog has a built-in floating point data type, it is too complex to be synthesized automatically.

Simplified 13-bit format

For these exercises we will use a simplified 13-bit format, ignoring the round-off error.

The representation consists in

1-bit sign, s: which indicates the sign of the number (1'b1 for negative)
4-bit exponent field, e: which represents the exponent
8-bit significant field, f: which represents the significand or the fraction

In this format the value of a floating point number is

(-1)^s * .f * 2^e

The .f*2^e is the magnitude of the number.

(-1)^s is a formal way to state that s equal 1 implies a negative number. Sign bit is separated from the rest of the number.

When the MSB of the significand field is 1 it is in Normalized representation.

The smallest normalized nonzero magnitude in this number format representation is

0.1000_0000 * 2^0000

We also make the following assumptions:

Both exponent and significand fields are in unsigned format
The representation has to be either normalized or zero, if the magnitude of the computation result is smaller than the smallest normalized nonzero magnitude it must be converted to zero.

A floating-point number consists of two fixed-point components, whose range depends exclusively on the number of bits or digits in their representation. The floating-point range linearly depends on the significand range and exponentially on the range of exponent component, which attaches outstandingly wider range to the number.

Under the above assumptions, the largest and smallest nonzero magnitudes for our simplified 13-bit format are 0.1111_1111 * 2 ^ 1111 and 0.1000_0000 * 2 ^ 0000. Between 0.1 and 8,355.84

Simplified floating point adder

We are going to design a floating point adder that follows the same steps as when we do the addition manually when working with scientific notation.
The computation is done in several steps as indicated in the diagram:

		sort	align	add/sub	normalize
eg. 1	+0.54e3 -0.87e4	-0.87e4 +0.54e3	-0.87e4 +0.05e4	-0.87e4 +0.05e4 -0.82e4	-0.87e4 +0.05e4 -0.82e4
eg. 2	+0.54e3 -0.55e3	-0.55e3 +0.54e3	-0.55e3 +0.54e3	-0.55e3 +0.54e3 -0.01e3	-0.55e3 +0.54e3 -0.10e2
eg. 3	+0.54e0 -0.55e0	-0.55e0 +0.54e0	-0.55e0 +0.54e0	-0.55e0 +0.54e0 -0.01e0	-0.55e0 +0.54e0 -0.00e0
eg. 4	+0.56e3 +0.52e3	+0.56e3 +0.52e3	+0.56e3 +0.52e3	+0.56e3 +0.52e3 +1.08e3	+0.56e3 +0.52e3 +0.10e4

Sorting: puts the number with the larger magnitude on the top and the number with the smaller magnitude on the bottom. The results are big_number and small_number.
Alignment: aligns the two numbers so that they have the same exponent. Adjust the exponent of the small_number to much the exponent of the big number. The significand of the small_number has to shift to the right according the difference in exponents.
Addition/subtraction: adds or substracts the significands of the two aligned numbers.
Normalization: adjusts the result to normalized format if
1. after subtraction the result contains leading zeros
2. or after subtraction the result is too small to be normalized, so needs to be converted to zero
3. or after addition the result generates a carry-out bit

We will ignore rounding, during the alignment and normalization the lower bits of the significand will be discarded when shifted out.

Floating point packed struct

In SystemVerilog we can create structured data types which we use to group a number of related variables together. We will create a packed structure to group the data that represents a 13-bit floating point number of the format that we have previously defined.

package FloatingPointPkg;

// 13-bit floating point
// 1-bit sign, s: which indicates the sign of the number (1'b1 for negative)
// 4-bit exponent field, e: which represents the exponent
// 8-bit significant field, f: which represents the significand or the fraction
typedef struct packed {
    logic sign;
    logic [3:0] exp;
    logic [7:0] frac;
} fp_t;

endpackage: FloatingPointPkg

We will define a new type for the 13-bit floating point struct type

1-bit sign, sign: which indicates the sign of the number (1'b1 for negative)
4-bit exponent field, exp: which represents the exponent
8-bit significant field, frac: which represents the significand or the fraction

We can define all your types inside a package and simply import them wherever we want in our code. We will save the code with the new data type as a new file: "fp_types.sv" so that all modules that use it can import it.

package FloatingPointPkg;

  typedef struct packed {
     logic sign;
     logic [3:0] exp;
     logic [7:0] frac;
  } fp_t;

endpackage:FloatingPointPkg

To import:

import FloatingPointPkg::fp_t;

Floating point sorter module

We design the adder in stages, the first stage rearranges the numbers from highest magnitude to lowest without taking into account the sign, as when we place one above, the largest, and one below, smallest when we are going to subtract them.

The sorter module assigns the number with the larger magnitude to big_number output // and assignd the number with the smaller magnitude to small_number output

One possible implementation in SystemVerilog. Note that we use the structure we have created to represent floating point numbers for code clarity.

// Assigns the number with the larger magnitude to big_number output
// and assignd the number with the smaller magnitude  to small_number output
module fp_sorter(
    input fp_t a,
    input fp_t b,
    output fp_t big_number,
    output fp_t small_number);

    assign big_number = ({a.exp, a.frac} >= {b.exp, b.frac})? a: b;
    assign small_number = ({a.exp, a.frac} < {b.exp, b.frac})? a: b;
endmodule

We need to import our floating point data type .

import FloatingPointPkg::fp_t;

A possible testbench

module fp_sorter_testbench;
    fp_t a;
    fp_t b;
    fp_t bign;
    fp_t smalln;

    fp_sorter uut(.a(a), .b(b), .big_number(bign), .small_number(smalln));

    initial
    begin
        a ='{1'b0, 4'b1111, 8'b1111_1111};   b ='{1'b0, 4'b0001, 8'b1111_0000}; #10;
        a ='{1'b0, 4'b0000, 8'b0000_0000};   b ='{1'b0, 4'b0001, 8'b1111_0000}; #10;
        a ='{1'b0, 4'b0000, 8'b0000_0000};   b ='{1'b1, 4'b0001, 8'b1111_0000}; #10;
        a ='{1'b0, 4'b0001, 8'b1111_0000};   b ='{1'b0, 4'b1111, 8'b1111_1111}; #10;
        $stop;
    end

endmodule

Simulation

The new sorter module returns the largest and smallest number in magnitude regardless of sign.

Schematic

Two comparators compare the fractional parts and exponents of both numbers. Based on the output signals of the two comparators, four 2-to-1 multiplexers route the fractional and exponent part signals of the two numbers to the outputs representing the largest and smallest number in our number sorter.

Alignment module

The alignment module aligns the two numbers so that they have the same exponent. It will adjust the exponent of the small_number to much the exponent of the big number. The significand of the small_number has to shift to the right according the difference in exponents.

`timescale 1ns / 1ps

import FloatingPointPkg::fp_t;

module fp_aligment(
    input fp_t bign,
    input fp_t smalln,
    output fp_t aligned );
    
    logic [3:0] exp_diff;
    always_comb
    begin
        exp_diff = bign.exp - smalln.exp;
        aligned.frac = smalln.frac >> exp_diff;
        aligned.exp = bign.exp;
        aligned.sign = smalln.sign;        
    end
endmodule

Simulation

Schematic

The difference in exponents is passed to a right shifter that shifts the significand of the small_number. The exponent of the aligned result is set to the value of the exponent of the big number.

Add/substract module

This module adds or substracts the significands of two aligned numbers.

`timescale 1ns / 10ps

import FloatingPointPkg::fp_t;

// This module adds or substracts the significands of two aligned numbers, same exponent
// assumes the number are ordered big then small
module fp_sum_significands (
    input fp_t bign,
    input fp_t smalln,
    output logic [8:0] sum);
    
    assign sum = (bign.sign == smalln.sign) ?
     {1'b0, bign.frac} + {1'b0, smalln.frac}
     : {1'b0, bign.frac} - {1'b0, smalln.frac};
    
endmodule

Testbench

module fp_sum_significands_testbench;

    fp_t bign;
    fp_t smalln;
    logic [8:0] sum;

    fp_sum_significands uut(.sum(sum), .bign(bign), .smalln(smalln));

    initial
    begin
        bign ='{1'b0, 4'b0011, 8'b1111_1111};   smalln ='{1'b0, 4'b0011, 8'b1111_0000}; #10;
        bign ='{1'b1, 4'b0011, 8'b1111_1111};   smalln ='{1'b1, 4'b0011, 8'b0011_0000}; #10;
        bign ='{1'b1, 4'b0011, 8'b1111_1111};   smalln ='{1'b0, 4'b0011, 8'b0011_0000}; #10;
        bign ='{0'b1, 4'b0011, 8'b1111_1111};   smalln ='{1'b1, 4'b0011, 8'b0011_0000}; #10;
        $stop;
    end

endmodule

Simulation

The 2-to-1 multiplexer selects the output based on the sign signal of both numbers, if both signs are equal then it routes the addition result, if they are different then it routes the subtraction result.

Leading 0s counter module

This module counts the number of leading zeros. It is like a priority encoder. It outputs the number of leading zeros in an 8-bit number, assumes that the are at least one high bit (value 1'b1) in case the are no bit in high it returns the higher count, 7.

This won't affect the next stage because the result will be used to shift the number to the left by the number of leading zeros. In the event that all bits are low to zero, the value it returns is irrelevant.

`timescale 1ns / 1ps

// outputs the number of leading zeros in an 8-bit number
// assumes that the are at least one high bit (value 1'b1)
// in case the are no bit in high it returns the higher count, 7
module fp_leading_zeros(
    input logic [7:0] number,
    output logic [2:0] lead0s
);

    always_comb
    begin
        if(number[7])
            begin
                lead0s = 3'o0;
            end
        else if (number[6])
            begin
                lead0s = 3'o1;
            end
        else if (number[5])
            begin
                lead0s = 3'o2;
            end
        else if (number[4])
            begin
                lead0s = 3'o3;
            end
        else if (number[3])
            begin
                lead0s = 3'o4;
            end
        else if (number[2])
            begin
                lead0s = 3'o5;
            end
        else if (number[1])
            begin
                lead0s = 3'o6;
            end
        else
            begin
                lead0s = 3'o7;
            end
    end
endmodule

Test-bench

module fp_leading_zeros_testbench;

    logic [7:0] number;
    logic [2:0] lead0s;

    fp_leading_zeros uut(.*);

    initial
    begin
        number = 8'b1111_1111; #10;
        number = 8'b0111_1111; #10;
        number = 8'b0011_1111; #10;
        number = 8'b0001_1111; #10;
        number = 8'b0000_1111; #10;
        number = 8'b0000_0111; #10;
        number = 8'b0000_0011; #10;
        number = 8'b0000_0001; #10;
        number = 8'b0000_0000; #10;
        $stop;
    end

endmodule

Simulation

Schematics

Like a priority encoder the priority network is implemented by a sequence of 2-to-1 multiplexers.

Normalization module

The Normalization module adjusts the result to normalized format if after subtraction the result contains leading zeros or after subtraction the result is too small to be normalized, so needs to be converted to zero or after addition the result generates a carry-out bit

A possible SystemVerilog implementation.

First shifts significand according leading 0s

// normalizes an unnnormalized floating point with carry out signal
module fp_normalize(
    input logic carry_out,
    input fp_t unnormalized,
    output fp_t normalized  );

    logic [2:0] lead_zeros;
    // leading zeros not incluiding the carry out
    fp_leading_zeros lead_zeros_unit(.number(unnormalized.frac),.lead0s(lead_zeros));

    always_comb
    begin
        if(carry_out) // with carry out, shift frac to the right
            begin
                normalized.exp = unnormalized.exp + 1;
                normalized.frac = {1'b1, unnormalized.frac[7:1]};
            end else if(lead_zeros > unnormalized.exp)
            begin
                normalized.exp = 0; // set to zero
                normalized.frac = 0;
            end else
            begin
                normalized.exp = unnormalized.exp - lead_zeros;
                normalized.frac = unnormalized.frac << lead_zeros; // shift significand accoding to leading 0
            end
        normalized.sign = unnormalized.sign;
    end
endmodule

Testbench

module fp_normalize_testbench;

    logic carry_out;
    fp_t unnormalized;
    fp_t normalized;


    fp_normalize uut(.*);

    initial
    begin
           carry_out = 1; unnormalized='{1'b1, 4'b0011, 8'b0000_1000}; #10;
           carry_out = 1; unnormalized='{1'b1, 4'b0011, 8'b1000_1000}; #10;
           carry_out = 0; unnormalized='{1'b1, 4'b0111, 8'b0000_1000}; #10;
           carry_out = 0; unnormalized='{1'b0, 4'b1011, 8'b1000_1000}; #10;
        $stop;
    end

endmodule

Simulation

Putting all together: Top Floating point Adder module

Finally we instantiate and connect the modules that we have designed previously:

 // circuit for reordering the inputs
 fp_sorter sort(.a(a), .b(b), .big_number(bign), .small_number(smalln));

 // circuit for aligning the smallest number
 fp_aligment align(.aligned(small_aligned), .bign(bign), .smalln(smalln));

 // circuit for add/substract the significands sum MSB 9th bit is carryout
 fp_sum_significands sum_significands(.sum(sum), .bign(bign), .smalln(small_aligned));

 // circuit for normalizing the output
 fp_normalize normalize(.carry_out(sum[8]), .unnormalized(unnormalized), .normalized(result));
 
 // connect addition/substraction result with the normalizer
 assign unnormalized = '{bign.sign, bign.exp, sum[7:0]};

SystemVerilog Code

`timescale 1ns / 10ps

import FloatingPointPkg::fp_t;

// binary floating point adder
module fp_adder (
    input fp_t a,
    input fp_t b,
    output fp_t result
);
    fp_t bign;  // big operand in absolute magnitude after sorting
    fp_t smalln; // small operand in absolute magnitude after sorting
    fp_t small_aligned; // small operand aligned whith the big one, same exponents
    logic [8:0] sum;  // sum of the two aligned significands with carry out
    fp_t unnormalized; // result before normalization

    // circuit for reordering the inputs
    fp_sorter sort(.a(a), .b(b), .big_number(bign), .small_number(smalln));
    // circuit for aligning the smallest number
    fp_aligment align(.aligned(small_aligned), .bign(bign), .smalln(smalln));
    // circuit for add/substract the significands
    fp_sum_significands sum_significands(.sum(sum), .bign(bign), .smalln(small_aligned));
    // circuit for normalizing the output
    fp_normalize normalize(.carry_out(sum[8]), .unnormalized(unnormalized), .normalized(result));
    
    // connect addition/substraction result with the normalizer
    assign unnormalized = '{bign.sign, bign.exp, sum[7:0]};

endmodule

Test bench

module fp_adder_testbench;
    fp_t a;
    fp_t b;
    fp_t c;

     fp_adder uut(.a(a), .b(b), .result(c));
    

    initial
    begin  
        a ='{1'b0, 4'b0001, 8'b1000_0000};   b ='{1'b0, 4'b0001, 8'b1000_0000}; #10; 
        a ='{1'b0, 4'b0111, 8'b1000_0000};   b ='{1'b0, 4'b0001, 8'b1000_0000}; #10; 
        a ='{1'b0, 4'b0011, 8'b1010_0000};   b ='{1'b0, 4'b0010, 8'b1001_0000}; #10; //  0.160 * 2 ^ 3 + 0.144 * 2 ^ 2 = 1.28 + 0.576 = 1,856
                                                                                     // 0,0011,11101000 = 0.232 * 2 ^ 3 =  1.856
        a ='{1'b0, 4'b0000, 8'b1000_0000};   b ='{1'b0, 4'b0001, 8'b1000_0000}; #10;
        a ='{1'b0, 4'b0000, 8'b1000_0000};   b ='{1'b1, 4'b0001, 8'b1111_0000}; #10;
        a ='{1'b0, 4'b0001, 8'b1111_0000};   b ='{1'b0, 4'b0011, 8'b1111_1111}; #10;
        $stop;
    end

endmodule

Simulation

Schematics

In the schematic we can see the four main blocks of our adder: the Classification circuit, the Alignment circuit, the Addition/Subtraction circuit and the Normalization circuit, all of them interconnected.

Expanded view

SystemVerilog Study Notes Chapters

DAB over 3 years ago

Nice walk through of the logic.
- Cancel
- Vote Up 0 Vote Down
- Sign in to reply
- More
- Cancel

SystemVerilog Study Notes. Simplified Floating Point Arithmetic. RTL Combinational Circuit

Table of Contents

Floating point arithmetic

Simplified 13-bit format

Simplified floating point adder

Floating point packed struct

Floating point sorter module

Alignment module

Add/substract module

Leading 0s counter module

Normalization module

Putting all together: Top Floating point Adder module

SystemVerilog Study Notes Chapters