Floating point numbers - what else can be done? • The Register

Original URL: https://www.theregister.com/2006/09/20/floating_point_numbers_2/

Floating point numbers - what else can be done?

Avoiding errors

Posted in Software, 20th September 2006 15:46 GMT

Column In a recent article here in The Register we saw some of the problems that result when floating point numbers are misused or chosen inappropriately.

Many people wrote in to say they had seen first hand some of the voodoo techniques we decried, so clearly we're in the midst of a numerical calculation crisis and, if we don't do something, there's going to be satellites falling from the skies around us - in itself undesirable, but so much more so when the satellite in question is the one the TV channels depend on.

In this article we're going to look at other ways of handling real numbers, including some upcoming extensions to the C and C++ languages that could well see floats become much less used.

To recap briefly, last time we looked at the approximation error in floating point numbers that results because floats and doubles represent real numbers as a fraction over 2ⁿ. As we humans have 10 fingers, and we reserve the right to lay the foundations of our number system on such anatomical considerations, the values we deal with in software will often be some fraction over 10ⁿ, for example .37 is 37 over 10². Because there is no way to express this number in a base-2 floating point format, there was a small approximation error and we saw that this small approximation error turned into a big error when we tried to round and convert back to a base-10 real number.

This time round we're going to see what the methods are for avoiding this type of error. The comments made about the first article suggested many approaches, so we're going to weigh up the pros and cons of each. The main contenders are fixed point numbers, rational numbers, and base-10 floating point numbers.

The idea behind scaled integers is to fix a precision at the outset and use it consistently for all the operations involving a particular type of value. Take working with dollars and cents as an example. Instead of using a floating point to represent the value '$1.37' we would use an integral number to hold the value '137' and remember that the value has an implicit a scaling factor of 10^-2.

The advantage of this approach is its simplicity; we can use native data types and the integral operations built into our hardware so the storage is efficient and the calculations are fast.

However, the problem with this approach is its inflexibility. The least significant place is chosen early in a project and it's difficult to change afterwards. If calculations result in numbers more precise than the representation then the extra precision is lost by truncation. Such errors accumulate and while steps can be taken to reduce them they are inhibited by encapsulation across function and class boundaries. Because flexibility and extensibility are important in software architecture, this is probably a sufficiently severe shortcoming to render this attractively simple solution unusable in many cases.

Rational numbers

How about rational numbers? Most of the numbers we deal with can be expressed as a fraction or ratio, so why not store these numbers as a numerator and a denominator so that we represent them accurately.

We're all familiar with the math for manipulating fractions; for addition and subtraction you rewrite both sides to have a common denominator and for multiplication and division you multiply nominators and denominators. There are some drawbacks with this approach, however. Firstly, as numerator and denominator are stored separately they are calculated separately. This means there will be twice as many operations per calculations as there are with floats.

Secondly, numerators and denominators can get big very quickly as calculations are performed. This means there needs to be an overflow protection that will factorise the numerator and denominator to make their values smaller as necessary. As this factorisation is not always possible, rational numbers can overflow.

Thirdly, any expressions involving addition, subtraction or comparison are going to have to determine lowest common denominators.

A final point - it's my impression that most interfaces use real numbers - when was the last time the store had a can of soda at $37/100? So at least in the presentation there's going to be extra conversions going from the rational format back to the real number. So this is an approach that works and for the languages that don't incorporate rational number types and there are almost certainly libraries available that are easy to understand.

In terms of performance, rational numbers may not compare very well to floating points and in terms of storage they'll be twice the size for a similar range. Rational numbers are also only suitable for fractions and not every number can be represented this way (e.g. √2). That said, if you want to see more, for C++ there is a boost implementation of rational numbers available here.

Base-10 floating point numbers

Which brings us to base-10 floating point numbers. If we remember from the previous article, the problem of large errors came from the small approximation errors that arose when base-10 real numbers were converted to the from x/2^y. Therefore, it seems that if we could instead represent our number as x/10^y then there would be no conversion to a base-2 format and consequently no approximation error. And because there's no conversion back to a base-10 number there's no large errors arising as described in the last article.

The IEEE floating point standard currently undergoing revision allows for a base-10 to be used, so this idea has been around for a long time. That said, current floating point hardware tends not to support the base-10 mode. The reason is that using base-10 implies using binary coded decimal, a number representation format about 20 per cent less storage efficient than base-2. This is because in general binary coded decimal uses four bits per decimal digit; in base-2 these four bits can represent 16 distinct values whereas in the same space BCD can represent only, well 10 distinct values. There are schemes that reduce the amount of redundant space, but not to the efficiency of base-2 and these schemes also render calculations more computationally expensive.

So what does using base-10 floating point actually mean? Well, as there effectively isn't a standard governing decimal floats we can't say exactly, but to give us an idea lets stick with the structure of the IEEE base-2 float.

The representation for a 32-bit decimal float is of the form 10^exp * 1.n where n is an arithmetic sequence 1 * 10^-1 + 1 * 10^-2 + … + 1 * 10^-5. You may remember that the equivalent binary sequence continued to 2<sup-23, the difference is because each decimal term takes 4-bits and five terms use 20-bits.

The IEEE representation allows 23-bits for the mantissa and, although there are three bits left over, these aren't adequate to represent a BCD number so we don't have a 10^-6 term and we can represent six digit numbers in our decimal float. This compares to eight useful digits of a decimal number when we use 32-binary float representations so the difference in storage efficiency is easily seen.

Six digits may seem too small to be useful - it's only five decimal places after all - but two things should be taken into consideration; firstly this is for a 32-bit float, most of the time 64-bit floats are available which would allow for a more respectable 12-digit number.

Secondly, the format for decimal floating point numbers is probably going to change as part of the revision to the floating point standard. Currently, if you want to start using decimal arithmetic and you're a Java programmer you're in luck, there's the java.math.BigDecimal class in the library.

If you're a C++ programmer there are plenty of libraries out there. The IBM one is in the early stages of development and contains known bugs, but as the IBM people are heavily involved in the associated extensions to C and C++, this library is probably going to resemble what will eventually be seen in the C++ language that bit more than the others.

This isn't all good news, however. Firstly, decimal floating calculations are going to be mostly done in software until the new floating standard is ratified and the hardware catches up. This means that for a while using decimal arithmetic is going to imply a performance cost and, depending on the constraints you're working to, that may or may be acceptable.

Secondly, converting to and from base-2 isn't the only source of error in floating point calculations. It is, however, the one that we've all seen a bit too often and, although we're still going to need to understand about floats and numerical analysis to do serious things with floats, at least it won't be as easy to shoot yourself in the foot with the basics.

So let's be thankful that we no longer work in the dark days when storage was at a premium and that we can do things that were unthinkable in the past; such as using four digits to store the year and sacrificing a few bits to make life that bit easier for the poor misunderstood man in the trench writing the code that keeps our satellites in orbit. ®

References

http://www2.hursley.ibm.com/decimal/
<a href="http://www.petebecker.com/js%