What is and how are Floating-point stored on a computer?

Data System Architecture


Computer representations of floating point numbers typically use a form of rounding to significant figures, but with binary numbers. The number of correct significant figures is closely related to the notion of relative error (which has the advantage of being a more accurate measure of precision, and is independent of the radix of the number system used).

Floating-point is ubiquitous (everywhere) in computer systems

  • Almost every language has a floating-point datatype (Javascript, Python, Java, Oracle (SQL), …)
  • Computers from PCs to supercomputers have floating-point accelerators (???)
  • Most compilers will be called upon to compile floating-point algorithms from time to time;
  • Every operating system must respond to floating-point exceptions such as overflow

Generally, the numbers represented in float are to big to fit in their physical representation (typically 32 bit). Therefore the result of a floating-point calculation must often be rounded in order to fit back into its finite representation.

This rounding_error is the characteristic feature of floating-point computation.


Approximate numeric

Avoid float and double if exact answers are required

If you need precise numbers (e.g. money), see fixed-point number (exact numeric).

Float are great, for geometry (2D, 3D,…).

Rounding Error

Floating-point arithmetic can only produce approximate results, rounding to the nearest representable real number.

Floating-point numbers offer a trade-off between accuracy and performance.

With a 52 bits of precision , if you're trying to represent numbers whose expansion repeats endlessly, the expansion is cut off after 52 bits.

Unfortunately, most software needs to produce output in base 10, and common fractions in base 10 are often repeating decimals in binary.

For example:

  • 1.1 decimal is binary 1.0001100110011 …;
  • .1 = 1/16 + 1/32 + 1/256 plus an infinite number of additional terms.

IEEE 754 has to chop off that infinitely repeated decimal after 52 digits, so the representation is slightly inaccurate.

Sometimes you can see this inaccuracy when the number is printed:

>>> 1.1

Guard Digits

Guard Digits are a means of reducing the error when subtracting two nearby numbers.


Floats (doubles) are fast because they are native type. Floats are usable with vector registers (xmm etc.) whereas decimals aren't.

In general, processors execute integer operations much faster than floating-point operations.


  • the first loop is easily twice as fast compared to the second loop.
// Integer
for (let i = 0; i < 1000; ++i) {
  // fast 🚀

// Float
for (let i = 0.1; i < 1000.1; ++i) {
  // slow 🐌
  • The performance of the modulo operator code depends on whether you’re dealing with integers or not.
const remainder = value % divisor;
// Fast 🚀 if `value` and `divisor` are represented as integers,
// slow 🐌 otherwise.


The IEEE standard gives an algorithm for addition, subtraction, multiplication, division and square root, and requires that implementations produce the same result as that algorithm.




Due to the rounding_error, equality function has always a delta parameter to define the permissible rounding error.

Delta or epsilon is defined as been: <MATH> | expected - actual |< epsilon </MATH>

Example: AssertEquals of double

Associativity Error

real numbers are associative but this is not always true of floating-point numbers:

console.log(   (0.1 + 0.2) + 0.3   ); // 0.6000000000000001
console.log(    0.1 + (0.2 + 0.3)  ); // 0.6

console.log(   ( (0.1 + 0.2) + 0.3 ) == ( 0.1 + (0.2 + 0.3) )  ); // false

Inexact representations

Always remember that floating point representations using float and double are inexact. Floating-point numbers offer a trade-off between accuracy and performance.

For example, consider these Javascript number expressions (Javascript supports only float)

console.log(999199.1231231235 == 999199.1231231236) // true
console.log(1.03 - 0.41) // 0.6200000000000001

In Java, for exactness, you want to use BigDecimal.

to Integer

Doubles (float) can represent integers perfectly with up to 53 bits of precision.

All of the integers from -9,007,199,254,740,992 (–2^53) to 9,007,199,254,740,992 (2^53) are then valid doubles.

Documentation / Reference

Discover More
Data System Architecture
Computer Number - Float64 (64-bit or double precision) floating-point number

Float64 is a floating point number with a 64bit precision. Float64 is also known as: 64-bit floating-point values, double precision floating-point 64-bit IEEE-754 floating-point or just double...
Data System Architecture
Float32 - 32-bit IEEE float (Single Precision)

32-bit IEEE float is a floating-point number encoded on 32 bit. float32 is also known simply as float Java Float Sql: Real, Float ...
Data System Architecture
How are Numbers represented on a Computer?

How are Numbers represented on a Computer? This section is number representation and storage on a computer. All numbers are stored physically in Bit that represents a binary system On a computer,...
Java Conceptuel Diagram
Java - Double

Double in Java They implements the 64-bit precision IEEE 754 floating point Documentation: java/lang/Double Double is a subtype of number. Literal Double.parseDouble always uses a dot...
Data System Architecture
Number - Floating-point (system|notation) - (Float|Double) - Approximate numeric

The term floating point refers to the fact that the number's radix point can float, that is, it can be placed anywhere relative to the significant digits of the number. They are fractional numbers...

Share this page:
Follow us:
Task Runner