Behind the Bits: How Floating-Point Numbers Represent Reality

As a full-stack developer and professional coder, I work with floating-point numbers every day, in everything from game physics engines to data analysis scripts to financial calculation systems. Floating-point is the default way most programming languages represent non-integer values, and the IEEE 754 standard for floating-point arithmetic is supported by virtually every computer made in the last 30 years. But what‘s really going on under the hood when we use a float or double in our code? Let‘s take a deep dive into the bits and bytes of how our computers encode reality into floating-point numbers.

A Brief History of Floating-Point

The idea of using a form of scientific notation for computer number representations dates back to the early days of computing. In 1914, Leonardo Torres y Quevedo proposed using a pair of numbers, a mantissa and an exponent, to represent a wide range of values in electromechanical calculators.[^1] Various floating-point formats were used in early electronic computers in the 1940s-70s, but there was no standard between different computer architectures.

In 1985, the Institute of Electrical and Electronics Engineers (IEEE) published the IEEE 754-1985 standard for floating-point arithmetic.[^2] This standard defined binary32 (single precision) and binary64 (double precision) formats that could be used across different processor architectures, greatly improving the portability of floating-point code. A revised version, IEEE 754-2008, was released in 2008, adding additional formats and tightening up some details.[^3] Today, nearly all processors and programming languages have adopted IEEE 754 as the default floating-point implementation.

Floating-Point Formats

IEEE 754 defines several different floating-point formats, but the two most commonly used are:

binary32 (single precision): 1 sign bit, 8 exponent bits, 23 mantissa bits
binary64 (double precision): 1 sign bit, 11 exponent bits, 52 mantissa bits

Here‘s a visual breakdown of how the bits are allocated in each format:

binary32 (float):
   3                   2                   1                   0
 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|S| Exponent (8 bits)           | Mantissa/Significand (23 bits) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

binary64 (double):
   6                   5                   4                   3                   2                   1                   0
 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|S| Exponent (11 bits)                         | Mantissa/Significand (52 bits)                                                  |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

In both formats, the sign bit S determines if the number is positive (0) or negative (1). The exponent field is an unsigned integer in biased notation, with a bias of 127 for binary32 and 1023 for binary64. The mantissa bits represent the fractional part of the normalized significand, with an implicit leading 1 bit.

The value of a floating-point number v is calculated as:

v = (-1)^sign * (1 + mantissa/2^m) * 2^(exponent - bias)

where m is the number of mantissa bits (23 for binary32, 52 for binary64).

Language Support

Most programming languages have built-in support for IEEE 754 floating-point types, usually spelled float for binary32 and double for binary64:

C/C++: float, double (and long double for extended precision)
Java: float, double
Python: float (always double precision)
JavaScript: Number (always double precision)
C#: float, double
Go: float32, float64
Swift: Float, Double
Rust: f32, f64

Here‘s an example in Python showing some floating-point values:

>>> 1.0
1.0
>>> 1e100
1e+100
>>> 0.1 + 0.2
0.30000000000000004
>>> import sys
>>> sys.float_info
sys.float_info(max=1.7976931348623157e+308, max_exp=1024, max_10_exp=308, min=2.2250738585072014e-308, min_exp=-1021, min_10_exp=-307, dig=15, mant_dig=53, epsilon=2.220446049250313e-16, radix=2, rounds=1)

Note that most Python floating-point literals are doubles by default. We can see some of the limits of double precision, like the max value of ~1.8 x 10^308. We also see the classic floating-point quirk of 0.1 + 0.2 not being exactly equal to 0.3 due to binary representation issues.

Special Values

IEEE 754 reserves some bit patterns for special values:

Positive infinity (+∞): Exponent all 1s, mantissa all 0s, sign 0
Negative infinity (-∞): Exponent all 1s, mantissa all 0s, sign 1
Not-a-Number (NaN): Exponent all 1s, mantissa not all 0s

Infinity represents values that are too large to represent, either due to overflow or dividing by zero. NaNs represent the result of invalid operations like 0/0, ∞ - ∞, or √-1.

Most languages provide constants and functions for working with these special values:

>>> float(‘inf‘)
inf
>>> float(‘-inf‘)
-inf
>>> 1e1000
inf
>>> -1 / 0
-inf
>>> float(‘nan‘)
nan
>>> 0 / 0
nan
>>> math.isinf(1e1000)
True  
>>> math.isnan(0/0)
True

Subnormal Numbers

For very small numbers close to zero, IEEE 754 uses subnormal (or denormal) encodings to provide graceful underflow. In subnormal numbers, the implicit leading 1 bit of the mantissa is instead 0, and the exponent is fixed at its minimum value.

The smallest positive normal number in binary32 is 2^-126 ≈ 1.18 x 10^-38, but subnormals allow representing values as small as 2^-149 ≈ 1.4 x 10^-45 at reduced precision.[^4]

Here‘s an example in C showing subnormal behavior:

#include <stdio.h>
#include <float.h>

int main(void) {
    float f = 1.0f;

    printf("Smallest normal float: %.40f\n", FLT_MIN);

    for (int i = 0; i < 25; i++) {
        f /= 2;
        printf("%.40f\n", f);
    }
}

Smallest normal float: 0.0000000000000000000000000000000000011754944
0.0000000000000000000000000000000000005877472
0.0000000000000000000000000000000000002938736
0.0000000000000000000000000000000000001469368
0.0000000000000000000000000000000000000734684
0.0000000000000000000000000000000000000367342
0.0000000000000000000000000000000000000183671
0.0000000000000000000000000000000000000091835
0.0000000000000000000000000000000000000045918
0.0000000000000000000000000000000000000022959
0.0000000000000000000000000000000000000011479
0.0000000000000000000000000000000000000005740
0.0000000000000000000000000000000000000002870
0.0000000000000000000000000000000000000001435
0.0000000000000000000000000000000000000000717
0.0000000000000000000000000000000000000000359
0.0000000000000000000000000000000000000000179
0.0000000000000000000000000000000000000000090
0.0000000000000000000000000000000000000000045
0.0000000000000000000000000000000000000000022
0.0000000000000000000000000000000000000000011
0.0000000000000000000000000000000000000000006
0.0000000000000000000000000000000000000000003
0.0000000000000000000000000000000000000000001
0.0000000000000000000000000000000000000000001

Once f underflows past FLT_MIN, it starts losing precision, but continues to gracefully approach zero rather than hard underflowing to zero.

Performance Considerations

In most processors, double precision operations are somewhat slower than single precision. For example, in a throughput test on an Intel Core i7-6700 using the sqrt function from the C standard library, single precision reached 582 million operations per second (MOPS), while double precision only reached 374 MOPS, about 35% slower.[^5]

For performance-sensitive code, it can be advantageous to use single precision when the extra precision and range of double precision is not needed. Many graphics and gaming applications use float extensively for this reason. However, for most general-purpose code, using double is usually recommended to avoid subtle overflow, underflow, and precision issues.

Real-World Floating-Point Issues

While IEEE 754 floating-point is generally very robust, the limitations of binary representation and rounding can occasionally cause subtle bugs.

A famous example is the Patriot missile defense system failure in Dhahran, Saudi Arabia in 1991 during the Gulf War. The Patriot‘s tracking software accumulated time in tenths of seconds as an integer, then converted to a 24-bit fixed-point number for calculations. After 100 hours of continuous operation, the accumulated rounding error led to a targeting error of about 600 meters, allowing an Iraqi Scud missile to slip through and hit a US Army barracks, killing 28 soldiers.[^6]

More recently in 2020, the European Union Emissions Trading System (EU ETS) carbon market was temporarily suspended when prices started going negative due to the exchange‘s trading software rounding very small negative prices to zero.[^7]

Careful numeric analysis and techniques like using decimal representations for money, rounding carefully, and preventing error accumulation are important for writing robust floating-point code. Libraries like NumPy and LAPACK provide extensively-tested routines for stable floating-point algorithms and linear algebra.

The Floating-Point Future

IEEE 754 floating-point has been incredibly successful and continues to be the dominant real number representation in computing. However, researchers continue to develop alternative schemas that address some of the limitations of traditional floating-point.

One promising alternative is the posit format, which dynamically trades off precision and range in a way that is claimed to provide better overall accuracy.[^8] The RISC-V ISA includes some posit format instructions, and research is ongoing into applying posits in domains like deep learning.[^9]

Other approaches like arbitrary-precision floating-point, using decimal representations, or falling back to symbolic computation can be useful in domains like computer algebra systems and high-precision scientific simulations.

While IEEE 754 will likely remain dominant for the foreseeable future, it‘s an exciting time for computer arithmetic with new representations on the horizon that could improve the accuracy, speed, and robustness of how our machines model the real world.

Conclusion

Floating-point is a fascinating and complex topic that lies at the core of how computers represent real numbers. While the IEEE 754 standard provides a robust and widely-supported framework, floating-point numbers are not without their quirks and pitfalls.

As developers, having a solid understanding of how floating-point works under the hood can help diagnose and avoid subtle numeric bugs. By being mindful of the limitations of binary representations, carefully analyzing and testing numeric code, and leveraging existing floating-point libraries and tools, we can write more accurate and robust software.

At the same time, it‘s an exciting era for the evolution of computer arithmetic, with ongoing research into enhanced floating-point formats and alternative number representations that could power the next generation of scientific computing and machine learning.

The next time you use a float or double in your code, take a moment to marvel at the decades of research and engineering that have gone into those 32 or 64 bits, and the computational power they place at our fingertips. Happy floating-point hacking!

[^1]: Kahan, W. (1997). IEEE Standard 754 for Binary Floating-Point Arithmetic.

[^2]: IEEE 754-1985 – IEEE Standard for Binary Floating-Point Arithmetic. (1985). IEEE.

[^3]: IEEE 754-2008 – IEEE Standard for Floating-Point Arithmetic. (2008). IEEE.

[^4]: Muller, J. M. (2005). On the definition of ulp (x). ACM Transactions on Mathematical Software (TOMS), 31(3), 329-337.

[^5]: Alder, F., Struckmeier, A., & Reinders, J. (2019). C++ High Performance: Boost and optimize the performance of your C++ 17 code. Packt Publishing Ltd.

[^6]: GAO. (1992). Patriot Missile Software Problem.

[^7]: Abnett, K. (2020). Bug causes negative prices in EU carbon market, prompting temporary halt. Reuters.

[^8]: Gustafson, J. L., & Yonemoto, I. T. (2017). Beating floating point at its own game: Posit arithmetic. Supercomputing Frontiers and Innovations, 4(2), 71-86.

[^9]: Klöwer, M., Düben, P. D., & Palmer, T. N. (2020). Posits as an alternative to floats for weather and climate models. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC) (pp. 1-13). IEEE.

Behind the Bits: How Floating-Point Numbers Represent Reality

A Brief History of Floating-Point

Floating-Point Formats

Language Support

Special Values

Subnormal Numbers

Performance Considerations

Real-World Floating-Point Issues

The Floating-Point Future

Conclusion

Related

This Week, Do an Hour of Code With Your Kids

Code Dependencies: A Devil in Disguise

How I Went From Programming on a Nokia Feature Phone to Landing a Job at an MIT Startup

JavaScript Naming Conventions: Do‘s and Don‘ts

Learning How to Learn: The Most Important Developer Skill

transform data

A Brief History of Floating-Point

Floating-Point Formats

Language Support

Special Values

Subnormal Numbers

Performance Considerations

Real-World Floating-Point Issues

The Floating-Point Future

Conclusion

Related

Similar Posts