﻿ Handling Floating Point Numbers | Christoph Rüegg

Did you know, that:

• a double can represent 0xFFDFFFFFFFFFFFFF = 18437736874454810623 distinct numbers?
• there are 450359962736 doubles between 1.0001 and 1.0002?
• 54975581 between 10'000.0001 and 10'000.0002?
• 52 between 10'000'000'000.0001 and 10'000'000'000.0002?
• the smallest difference between neighbor numbers is 4.941E-324 for 0, and the biggest difference 1.996E+292 for double.MaxValue?
• the difference between 1.0 and the biggest number smaller than 1.0 is 1.110E-16
(this value is often called negative epsilon (2 = the (positive) epsilon), but not in .Net!* In the framework, double.Epsilon is defined as 4.941E-324, the number from above!)

It's common knowledge that floating point numbers are somewhat tricky to work with, especially if rounding-off errors and exact values become important. Although this is (hopefully) discussed in any computer science and engineering study course, there has been quite some blogging about it lately:

One of the conclusions is that it's dangerous to compare floating point numbers directly. Of course this is a problem in Math.NET Iridium, so I tried to add some helper functions to Iridium for handling double numbers. You'll find them in the new static Number class directly in the MathNet.Numerics namespace. The most important function, AlmostEquals, is inspired by Bruce Dawson's article (see above). They're propositions, so please let me know if you think they're flawed...

## Integer Representation of Doubles

In order to work with doubles in an exact way, we have to treat them as integers. We can convert them either by doing some unsafe pointer casting, by emulating a union (StructLayout and FieldOffset attributes), or simply by using the BitConverter (which uses the first pointer approach internally; System.Runtime.InteropServices namespace).

Unfortunately, floating point numbers are stored as signed magnitudes, while integers use the two's complement to represent negative numbers. We thus have to convert negative numbers if we need them to be lexicographically ordered (such that the distance from 0 to 1 equals to the distance from -1 to 0, and that -0 = +0):

 1: 2: 3: 4: 5: 6:  public static long SignedMagnitudeToTwosComplementInt64(long value) { return (value >= 0) ? value : (long)(0x8000000000000000 - (ulong)value); } 
 1: 2: 3: 4: 5: 6: 7: 8:  public static ulong ToLexicographicalOrderedUInt64(double value) { long signed64 = BitConverter.DoubleToInt64Bits(value); ulong unsigned64 = (ulong)signed64; return (signed64 >= 0) ? unsigned64 : 0x8000000000000000 - unsigned64; } 

The expected behavior (excerpt from the unit tests):

 1: 2: 3: 4: 5: 6: 7: 8: 9:  ToLexicographicalOrderedUInt64(2 * double.Epsilon) == 2 ToLexicographicalOrderedUInt64(1 * double.Epsilon) == 1 ToLexicographicalOrderedUInt64(0.0) == 0 ToLexicographicalOrderedUInt64(-1 * double.Epsilon) == 0xFFFFFFFFFFFFFFFF SignedMagnitudeToTwosComplementInt64(1) == 1 SignedMagnitudeToTwosComplementInt64(0) == 0 SignedMagnitudeToTwosComplementInt64(-9223372036854775808) == 0 SignedMagnitudeToTwosComplementInt64(-9223372036854775808 + 1) == -1 

## Incrementing Doubles

Thanks to this two's complement integer representation we immediately have the power to increment or decrement floating point numbers, and iterate through all numbers a double can represent in a given range:

  1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16:  public static double Increment(double value) { if(double.IsInfinity(value) || double.IsNaN(value)) return value; long signed64 = BitConverter.DoubleToInt64Bits(value); if(signed64 < 0) signed64--; else signed64++; if(signed64 == -9223372036854775808) // = "-0", make it "+0" return 0; value = BitConverter.Int64BitsToDouble(signed64); return double.IsNaN(value) ? double.NaN : value; } 

Again an excerpt of the expected behavior:

  1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15:  double x = double.Epsilon; x = Decrement(x); --> x == 0 x = Decrement(x); --> x == -double.Epsilon x = Increment(x); --> x == 0 x = Increment(x); x = Increment(x); --> x == 2 * double.Epsilon x = double.MaxValue; x = Decrement(x); --> x < double.MaxValue x = Increment(x); --> x == double.MaxValue 

## Evaluating the Precision: Epsilon

The incrementation step depends on the exponent, so it may differ between different numbers. It is important for example in iterative approximation algorithms to know when to stop. The following function returns the decrementation step near a given number, often referred to as a scaled negative epsilon:

  1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16:  public static double EpsilonOf(double value) { if(double.IsInfinity(value) || double.IsNaN(value)) return double.NaN; long signed64 = BitConverter.DoubleToInt64Bits(value); if(signed64 == 0) { signed64++; return BitConverter.Int64BitsToDouble(signed64) - value; } else if(signed64-- < 0) return BitConverter.Int64BitsToDouble(signed64) - value; else return value - BitConverter.Int64BitsToDouble(signed64); } 

And some examples:

 1: 2: 3: 4: 5:  EpsilonOf(1.0).ToString() == "1.11022302462516E-16" EpsilonOf(0.0).ToString() == "4.94065645841247E-324" EpsilonOf(-1.0e+100).ToString() == "1.94266889222573E+84" EpsilonOf(1.0e-100).ToString() == "1.26897091865782E-116" EpsilonOf(double.MinValue).ToString() == "1.99584030953472E+292" 

## Numbers between two doubles

It may be interesting to know how many numbers can be represented by a double variable, which are all between two given numbers. Using the methods introduced above this is simple:

 1: 2: 3: 4: 5: 6: 7: 8: 9:  public static ulong NumbersBetween(double a, double b) { // left out (see repository): handling NaN and Infinity ulong ua = ToLexicographicalOrderedUInt64(a); ulong ub = ToLexicographicalOrderedUInt64(b); return (a >= b) ? ua - ub : ub - ua; } 

Examples:

 1: 2: 3: 4: 5: 6: 7: 8:  NumbersBetween(1.0, 1.0) == 0 NumbersBetween(0, double.Epsilon) == 1 NumbersBetween(-double.Epsilon, 2 * double.Epsilon) == 3 double test = Math.PI * 1e+150; NumbersBetween(test, test + 10 * Number.EpsilonOf(test) == 10 NumbersBetween(1.0001, 1.0002) == 450359962737 NumbersBetween(10000000000.0001, 10000000000.0002) == 53 NumbersBetween(double.MinValue, double.MaxValue) == 18437736874454810622 

## Compare doubles for (almost) equality

Now we come to the final but most important method, for a safe and more sensible way of comparing two floating point numbers for almost equality. Call this method with two doubles and a parameter specifying how many numbers may be between the two numbers at most (+1):

  1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17:  public static bool AlmostEqual(double a, double b, int maxNumbersBetween) { // left out (see repository): argument checks // NaN's should never equal to anything if(double.IsNaN(a) || double.IsNaN(b)) //(a != a || b != b) return false; if(a == b) return true; // false, if only one of them is infinity or they differ on the infinity sign if(double.IsInfinity(a) || double.IsInfinity(b)) return false; ulong between = NumbersBetween(a, b); return between <= (uint)maxNumbersBetween; } 

Examples:

  1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11:  Convert.ToDouble("3.170404") == 3.170404 --> TRUE Convert.ToDouble("4.170404") == 4.170404 --> FALSE Number.AlmostEqual(Convert.ToDouble("3.170404"), 3.170404, 0) --> TRUE Number.AlmostEqual(Convert.ToDouble("4.170404"), 4.170404, 0) --> FALSE Number.AlmostEqual(Convert.ToDouble("4.170404"), 4.170404, 1) --> TRUE Number.AlmostEqual(0.0, 0.0 + double.Epsilon, 0) --> FALSE Number.AlmostEqual(0.0, 0.0 + double.Epsilon, 1) --> TRUE double max = double.MaxValue; Number.AlmostEqual(max, max - 2 * Number.EpsilonOf(max), 0) --> FALSE Number.AlmostEqual(max, max - 2 * Number.EpsilonOf(max), 1) --> FALSE Number.AlmostEqual(max, max - 2 * Number.EpsilonOf(max), 2) --> TRUE 
Multiple items
val double : value:'T -> double (requires member op_Explicit)

Full name: Microsoft.FSharp.Core.ExtraTopLevelOperators.double

--------------------
type double = System.Double

Full name: Microsoft.FSharp.Core.double
val max : e1:'T -> e2:'T -> 'T (requires comparison)

Full name: Microsoft.FSharp.Core.Operators.max