Original URL: https://www.theregister.com/2006/08/12/floating_point_approximation/

The trouble with rounding floating point numbers

So, Prudence, computers do make mistakes…

By Dan Clarke

Posted in Software, 12th August 2006 09:55 GMT

We all know of floating point numbers, so much so that we reach for them each time we write code that does math. But do we ever stop to think what goes on inside that floating point unit and whether we can really trust it?

I hate to cast aspersions on its good name but when I hear stories of space craft crashing, inconsistent information on bank statements and pensioners being short changed (all of which have happened: see, for example, Risks Digest entries here and here), I start to realise that there is a real danger of misusing floating point numbers. Indeed, anyone with a few years of experience under their belt will probably have either had the pleasure of dealing with a floating-point related bug; or have watched a colleague slowly go crazy over one.

Often, the underlying cause of such problems falls into common categories: a division by zero or a narrowing conversion that loses information. Other times however, it's not so evident – sometimes the cause is the futile attempt of a software developer to round a floating-point number.

That's right, one of the most basic operations in math, a thing that we learn to do before we can ride a bike, eludes the combined efforts of the finest engineers over the last 30 years. Of course, this is something that is intuitively nonsensical - why should it be impossible to round a floating-point number reliably?

To understand why we need to understand a little about how floating points are represented internally in the IEEE-754 specification. The key thing is to recognize that floating types do not represent numbers exactly. Internally the value is not a continuous range of numbers; instead it is represented as an exponent multiplied by an arithmetical series (such as, for example, ½1 + ½2 + ½3 + … + ½23). This means that in the range of floating point numbers there are gaps; between any two floating point numbers there is a difference related to smallest element in the arithmetical series (½23 in the example).

So what happens if we have a number that falls into such a gap? Well the system will choose a floating point number that has a close value. For example, the real number ‘.37’ cannot be represented exactly by the arithmetic series described above so, if you assign this number to a floating point, the value stored could actually be ‘0.370000004’. This can be seen easily if we write a simple program that prints a floating point value to a lot of decimal places.

 Example 1: showing approximation error.

// some code to print a floating point number to a lot of 
// decimal places
int main()
{
    float f = .37;
    printf("%.20f\n", f);
}

We call the difference between the real value of a number and the actual value of the floating point an approximation error. This error is a fundamental property of floats and with regard to rounding there are two important points; firstly, the approximation can be either under or over the exact number so, for example, imagine that .37 is represented as .369999998 rather than .370000004. Secondly, for a given value the approximation error isn't always the same; it depends on how the value was calculated. This means we cannot predict if the approximation will be above or below the real number.

So now that we've seen this approximation error, we can start to think about how rounding works; and we'll spot the fundamental incompatibility. A typical rounding method works by looking at a given digit and if it's above or equal to a threshold we round up, otherwise we round down. But when the digit concerned is part of a number that contains an approximation error, we have to consider if the approximation error has changed the value of the digit used for rounding. Unfortunately, there is no easy way to know this; and this means that we are applying an algorithm that requires exact digits to detect boundaries, to an approximated number. The following example shows the worse case, what happens to our rounding calculation when the approximation error changes the value of the digit used for rounding and we get a different rounded number as a result.

 Example 2: showing the failure mode of rounding floats.

// doCalc is a function that takes a euro price and a percent margin 
// and returns the result of applying margin to the price.  In the call 
// below the result should be “66.375” – Eur59.00 + 12.5% of 59
void test() {
    // As it’s a price we’re interested in 2 decimal places. Our 
    // rounding algorithm will look at third decimal place and round 
    // according to its value.  We expect this to round 66.375 to a 
    // result of Eur66.38
    float price = doRateCalc("59.00", "12.5"); 

    // However: our value is represented approximately and could be 
    // anywhere in the range of, lets say, 66.3749999999999 to
    // 66.37500000000000000001.  If the approximation is at the bottom
    // of this range then we will round to 66.37 otherwise we will round
    // to 66.38
}

So, imagine that your financial or scientific application is out in the real world and a customer phones up wondering why the calculations in your software are yielding marginally different results to the ones he's doing the old fashioned way (with pen and paper).

Given that scientists are generally quite fussy and that banks are heavily regulated on questions of rounding and precision, you can easily imagine that this customer is going to get progressively less polite as time goes on. To make things more interesting, these problems are not the easiest to debug either, since printing or inspecting the values will generally show you an interpretation that hides the real problem.

Eventually you'll turn to looking at the value in binary format and after a bit of revision on the floating point representation, you’ll spot the problem. And at this point, the guy next to you will shout, “This is easy! We’ll just add a really small number to make sure that our floating point approximation is always above the real number and never below it”; and this will seem like the most sensible thing in the world for a while…

But the catch is that if the number contains a value with a precision equal to or greater than the ‘tweaker’ value that is added then the problem remains. As results with lots of decimal places are quite common (for example, simply divide 10 by 7), we can see that this isn’t a real solution.

 Example 3: showing “failure mode” when using tweakers.

    void showTweakerProblem()
    {
        // it doesn't really matter what value we choose for tweaker here
        // - typically you see values chosen much smaller than this.
        // The key thing is that we're going to apply it to a floating point
        // number with as many decimal places as the tweaker so if you want to 
        // pick a smaller value, adjust the floating value appropriately to 
        // see the same problem
        float const tweaker = 0.0000000005;
        float val = doCalc();

        // here we imagine that val is an approximation of .077499995 and we 
        // want to round to 2 decimal places for a result of .77. We add the 
        // tweaker to get either '.077499999<lots of 9s>' or '.077500000<lots of 0s>'
        // and then we round to get either '.77' or '.78'
        val += tweaker;
    }

Then, the real voodoo suggestions start to arrive. People will talk of using doubles instead of floats, truncating values before applying the tweaker, truncating values after applying the tweaker, truncating values both before and after applying the tweaker; none of which actually resolve the above problem.

Fundamentally, if you're using floats you’re using an approximate representation and you’ve lost a small amount of information on the exact value of the number. As rounding is a process that requires exact values there simply isn’t a silver bullet solution – see the guide to “what every computer scientist should know about floating-point arithmetic” here, if you want a lot more deeply technical detail on this issue

So if this is a real problem, “why haven’t people noticed it before?”, I hear you say. Well, at the start of the article we mentioned a story of pensions being miscalculated (see here and here, again, for several examples), so people are noticing it, and it is even embarrassing The Great and Good on occasion.

More important, perhaps, how do we avoid this problem? In a follow-up article I'm going to look at a different way of doing arithmetic (decimal arithmetic), used by lots of real world software, that avoids the approximation error in floats. ®