Guide: Kuruvilla Varghese, Principal Research Scientist, DESE, IISc

**INTRODUCTION**

Floating Point arithmetic is by far the most used way of approximating real number arithmetic for performing numerical calculations on modern computers.

Each computer had a different arithmetic for a long time: bases, significant and exponents’ sizes, formats, etc. Each company implemented its own model and it hindered the portability between different equipment until IEEE 754 standard appeared defining a single and universal standard.

The aim of this project is implementing a 32-bit binary floating point adder according to the IEEE 754 standard using the hardware programming language VHDL.

The floating point numbers representation is based on the scientific notation: the decimal point is not set in a fixed position in the bit sequence, but its position is indicated as a base power.

All the floating point numbers are composed of three components:

• Sign: it indicates the sign of the number (0 positives and 1 negative)

• Mantissa: it sets the value of the number

• Exponent: it contains the value of the base power (biased)

• Base: the base (or radix) is implied and it is common to all the numbers (2 for binary numbers)

by all the developers.

Standard IEEE 754 specifies formats and methods in order to operate with floating point arithmetic.

These methods for computational with floating point numbers will yield the same result regardless the processing is done in hardware, software or a combination of the two or the implementation.

If a Simple Precision format is used the bits will be divided in that way:

The first bit (31st bit) is set the sign (S) of the number (0 positives and 1 negative).

Next w bits (from 30th to 23rd bit) represents the exponent (E).

The rest of the string, t, (from 22nd to 0) is reserved to save the mantissa. The mantissa value is 23 bits long but it contains an implicit bit depending on the type of data (1 for normal numbers and 0 for subnormal).

**CODE DEVELOPMENT**

Once the standard IEEE 754 has been explained it is time to start with the implementation of the code. First of all thinking about the different steps, we should do to perform the operation required is compulsory. It is because of this that this section will talk about the procedure in addition/subtraction operations and a first look at the code design in block diagram way.

Addition/Subtraction Steps

1. Extracting signs, exponents, and mantissa of both A and B numbers.

2. Treating the special cases: Operations with A or B equal to zero, Operations with ±∞, Operations with NaN

3. Finding out what type of numbers are given: Normal, Subnormal, Mixed

4. Shifting the lower exponent number mantissa to the right [Exp 1 − Exp 2] bits. Setting the output exponent as the highest exponent.

5. Working with the operation symbol and both signs to calculate the output sign and determine the operation to do.

6. Addition/Subtraction of the numbers and detection of mantissa overflow (carry bit).

7. Standardizing mantissa shifting it to the left up the first one will be in the first position and update the value of the exponent according to the carry bit and the shifting over the mantissa.

8. Detecting exponent overflow or underflow (result NaN or ±∞)

This is the way forward to proper operation.

A block diagram will be designed to support the explanation and facilitate the comprehension. Moreover, it will be used to design the different blocks in VHDL which form the 32¬bit Floating Point Adder.

**PRE-ADDER**

The first block is the Pre-adder. It is in the charge of distinguishing the type of numbers which are introduced as an input.

Four different cases are possible:

- One of the special cases: NaN Infinity, Infinity, Normal, Zero, Subnormal, etc.
- A two subnormal numbers introduction.
- A mixed option between normal and subnormal numbers.
- A two normal numbers introduction

All these cases must be treated separately because of the process to achieve a successful operation must be different.

**ADDER**

This one is in charge of operating the numbers which have been prepared in the Preadder block.

The adder is a fundamental piece of the design because it implements the addition/subtraction operation, the main purpose of the 32 bit Floating Point Adder. The Adder block is composed of two entities: signout and adder. Signout is responsible for the sign operation and the adder is the adder strictly speaking.

**STANDARDISER**

** **The Standardiser is responsible for displaying the addition/subtraction operation value according to the IEEE 754 standard. This block is composed of four entities. Shift_left and zero_counter, Round and normalize. Basically, they are in charge of dealing with the result obtained from the adder and showing it in the same format as the numbers had been introduced.

**RESULTS**

At this point, the simulations to test the operation will be commented. As it has been done before four different cases could happen: special case, normal, subnormal or mixed numbers.

All the different possibilities must be tested and this is the reason why the different data types will be treated separately.

The procedure will be as follows:

- Enough different cases for each data type to demonstrate the correct working will be taken into account. The binary values of the entries and the output will be grouped in a table.
- Using the simulation the result will be obtained and added to the table.
- The decimal value of the numbers and the result will be calculated with the formula which had been explained at the standard IEEE 754 chapter.
- Simulation value will be compared with the arithmetic value in order to see as similar or different the numbers will be.

**CONCLUSION**

The reached goal is the implementation of a 32bits adder/subtractor based on floating point arithmetic according to the IEEE 754 standard. This design works with all the numbers defined by the standard: normal and subnormal. Furthermore, all the exceptions are taken into account as NaN, zero or infinity.

The adder or the shifter is implemented with a known structure. Predetermined operations as the addition (+) or shifting (SLL or SLR) are allowed but a generate function is used and the device has improved time response.

Finally used the code over an FPGA and tested it physically over a board which leaves the design completely finished.