Handling Floating Point Numbers
Did you know, that:
- a double can represent 0xFFDFFFFFFFFFFFFF = 18437736874454810623 distinct numbers?
- there are 450359962736 doubles between 1.0001 and 1.0002?
- 54975581 between 10'000.0001 and 10'000.0002?
- 52 between 10'000'000'000.0001 and 10'000'000'000.0002?
- the smallest difference between neighbor numbers is 4.941E-324 for 0, and the biggest difference 1.996E+292 for double.MaxValue?
-
the difference between 1.0 and the biggest number smaller than 1.0 is 1.110E-16
(this value is often called negative epsilon (2 = the (positive) epsilon), but not in .Net!* In the framework, double.Epsilon is defined as 4.941E-324, the number from above!)
It's common knowledge that floating point numbers are somewhat tricky to work with, especially if rounding-off errors and exact values become important. Although this is (hopefully) discussed in any computer science and engineering study course, there has been quite some blogging about it lately:
- Floating-Point Numbers (Wesner Moise)
- .999... = 1 (Wesner Moise)
- The trouble with rounding floating point numbers (Dan Clarke)
- Quiz of the Month: Double Trouble (Kathy Kam)
- Why you can't Double.Parse(double.MaxValue.ToStr... (Scott Hanselman)
- Floating Point Arithmetic (Wesner Moise)
- Numbers in .NET (Wesner Moise)
- Binary floating point and .NET
- Comparing floating point numbers (Bruce Dawson)
- What Every Computer Scientist Should Know About Floating-Point Arithmetic (PDF)
One of the conclusions is that it's dangerous to compare floating point numbers directly. Of course this is a problem in Math.NET Iridium, so I tried to add some helper functions to Iridium for handling double numbers. You'll find them in the new static Number class directly in the MathNet.Numerics namespace. The most important function, AlmostEquals, is inspired by Bruce Dawson's article (see above). They're propositions, so please let me know if you think they're flawed...
Integer Representation of Doubles
In order to work with doubles in an exact way, we have to treat them as integers. We can convert them either by doing some unsafe pointer casting, by emulating a union (StructLayout and FieldOffset attributes), or simply by using the BitConverter (which uses the first pointer approach internally; System.Runtime.InteropServices namespace).
Unfortunately, floating point numbers are stored as signed magnitudes, while integers use the two's complement to represent negative numbers. We thus have to convert negative numbers if we need them to be lexicographically ordered (such that the distance from 0 to 1 equals to the distance from -1 to 0, and that -0 = +0):
1: 2: 3: 4: 5: 6: |
|
1: 2: 3: 4: 5: 6: 7: 8: |
|
The expected behavior (excerpt from the unit tests):
1: 2: 3: 4: 5: 6: 7: 8: 9: |
|
Incrementing Doubles
Thanks to this two's complement integer representation we immediately have the power to increment or decrement floating point numbers, and iterate through all numbers a double can represent in a given range:
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: |
|
Again an excerpt of the expected behavior:
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: |
|
Evaluating the Precision: Epsilon
The incrementation step depends on the exponent, so it may differ between different numbers. It is important for example in iterative approximation algorithms to know when to stop. The following function returns the decrementation step near a given number, often referred to as a scaled negative epsilon:
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: |
|
And some examples:
1: 2: 3: 4: 5: |
|
Numbers between two doubles
It may be interesting to know how many numbers can be represented by a double variable, which are all between two given numbers. Using the methods introduced above this is simple:
1: 2: 3: 4: 5: 6: 7: 8: 9: |
|
Examples:
1: 2: 3: 4: 5: 6: 7: 8: |
|
Compare doubles for (almost) equality
Now we come to the final but most important method, for a safe and more sensible way of comparing two floating point numbers for almost equality. Call this method with two doubles and a parameter specifying how many numbers may be between the two numbers at most (+1):
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: |
|
Examples:
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: |
|
val double : value:'T -> double (requires member op_Explicit)
Full name: Microsoft.FSharp.Core.ExtraTopLevelOperators.double
--------------------
type double = System.Double
Full name: Microsoft.FSharp.Core.double
Full name: Microsoft.FSharp.Core.Operators.max