Floating point numbers - what else can be done?
Column In a recent article here in The Register we saw some of the problems that result when floating point numbers are misused or chosen inappropriately.
Many people wrote in to say they had seen first hand some of the voodoo techniques we decried, so clearly we're in the midst of a numerical calculation crisis and, if we don't do something, there's going to be satellites falling from the skies around us - in itself undesirable, but so much more so when the satellite in question is the one the TV channels depend on.
In this article we're going to look at other ways of handling real numbers, including some upcoming extensions to the C and C++ languages that could well see floats become much less used.
To recap briefly, last time we looked at the approximation error in floating point numbers that results because floats and doubles represent real numbers as a fraction over 2n. As we humans have 10 fingers, and we reserve the right to lay the foundations of our number system on such anatomical considerations, the values we deal with in software will often be some fraction over 10n, for example .37 is 37 over 102. Because there is no way to express this number in a base-2 floating point format, there was a small approximation error and we saw that this small approximation error turned into a big error when we tried to round and convert back to a base-10 real number.
This time round we're going to see what the methods are for avoiding this type of error. The comments made about the first article suggested many approaches, so we're going to weigh up the pros and cons of each. The main contenders are fixed point numbers, rational numbers, and base-10 floating point numbers.
The idea behind scaled integers is to fix a precision at the outset and use it consistently for all the operations involving a particular type of value. Take working with dollars and cents as an example. Instead of using a floating point to represent the value '$1.37' we would use an integral number to hold the value '137' and remember that the value has an implicit a scaling factor of 10-2.
The advantage of this approach is its simplicity; we can use native data types and the integral operations built into our hardware so the storage is efficient and the calculations are fast.
However, the problem with this approach is its inflexibility. The least significant place is chosen early in a project and it's difficult to change afterwards. If calculations result in numbers more precise than the representation then the extra precision is lost by truncation. Such errors accumulate and while steps can be taken to reduce them they are inhibited by encapsulation across function and class boundaries. Because flexibility and extensibility are important in software architecture, this is probably a sufficiently severe shortcoming to render this attractively simple solution unusable in many cases.
Sponsored: IBM FlashSystem V9000 product guide