Original URL: http://www.theregister.co.uk/2006/08/14/math_managing_defects/

Mathematical approaches to managing defects

Radical new approaches toward software testing needed?

By David Norfolk

Posted in Developer, 14th August 2006 14:49 GMT

Software testing is still a controversial subject – everybody agrees that it is a "good thing", but it is frequently the first bit of the process to get cut when deadlines bite.

After all, those sneaky testers are really responsible for the bugs, aren't they? Our software is just fine until strangers start poking around inside it, trying to stop it going out to our eager users. Mind you, I was taken aback once when I asked some people in a bank why they imposed silly deadlines on the IT group I worked for - and was told that they didn't expect our software to work anyway.

So, they said, they'd rather get it, broken, a year before they needed it, with a year to iron out the bugs before it was deployed; than get it just before they really needed it and risk disrupting operational business systems with broken software.

That was then, and it represented a very expensive approach to testing, but things don't seem to have improved much. Natalia Juristo and Ana M Moreno, (Universidad Politécnica de Madrid) and Wolfgang Strigel (QA Labs), the guest editors of the July/August 2006 issue of IEEE Software (featuring Software Testing Practices in Industry), can still say: "Despite...obvious benefits, the state of software testing practice isn't as advanced as software development techniques overall. In fact, testing practices in industry generally aren't very sophisticated or effective. This might be due partly to the perceived higher satisfaction from developing something new...Also, many software engineers consider testers second-class citizens."

This highlights the fact that many of the issues with testing derive from a failure of process and the people carrying out the process, rather than from failures in technology or the supply of tools. After all, it is well known that defects are cheaper to remove the earlier that you find them, and cheapest of all if they're never introduced in the first place. But what chance is there of producing defect-free code if the most enjoyable part of the development process, for many programmers, is hunting bugs?

Unfortunately, if you only ever find some of the bugs in your code (as Myers pointed out in chapter one of The Art of Software Testing, it is "impractical, often impossible, to find all the errors in a program"), then the more you put in (typically, by guessing something, in the expectation of supplying the right answer while debugging the code), the more bugs there will be in the delivered product.

And yet, increasing legal regulation and concern with security issues makes defects in delivered systems increasingly unacceptable. It is unlikely that "more of the same" will work any better than it ever has, so perhaps it is time to try radically new approaches to managing defect removal; and mathematically-based approaches might take some of the human issues out of the equation.

Bayesian Analysis and Formal Methods are examples of such approaches. They are established enough for reasonable maturity, but they are not yet widely employed in software development generally. Perhaps they should be.

Next page: Bayesian analysis, by Pan Pantziarka

Bayesian analysis, by Pan Pantziarka

Some of the hardest questions to answer in development are about whether testing is "finished": Have we done enough testing, where should we concentrate testing effort, and when do we release the software?

There are usually countless pressures influencing these decisions – with enormous penalties in terms of loss of prestige as well as financial consequences if the decisions are badly wrong – and yet very often we depend on "gut feel" for an answer.

Even when software is passed through a formal testing process, the question of when to stop testing is not an easy one to answer. Does the fact that a component or module has had a lot of defects picked up (and corrected) during testing, tell us more about the quality of the component or the efficacy of the tests?

Given the reality that we can never get the resources required to test as much as we would want, and, just as importantly, that the testing process is itself imperfect, is there anything better than intuition to help developers gauge when software is ready to roll?

One of the things that would help is an objective model of the quality of a package at any given phase of the development lifecycle. Such a model can then be used to predict accurately the number of defects that remain to be discovered at any stage in the development lifecycle. It then becomes possible to base the "when do we release" decision on something other than gut instinct.

This is precisely the task that Paul Krause of University of Surrey set out to do with the Philips Software Centre (PSC) with Martin Neil and Norman Fenton of Agena Ltd.

Using Bayesian Networks, they have developed a general model of the development processes at PSC, which has been applied to a number of different software projects (see the detailed research paper here, together with the references therein). Similar work has also been done at Motorola Research Labs in Basingstoke and at Qinetiq.

Bayesian Networks, also known as Bayesian Belief Networks or graphical probabilistic models, are ideal for tasks of this kind. They are a technique for representing causal relationships between events and utilising probability theory to reason about these events in the light of available evidence.

Set of nodes

A Bayesian Network consists of a set of nodes which represent the events of interest, and directed arrows which represent the influence of one event on another. Each node may take on a range of values or states – a node which represents a thermostat, for example, may have states corresponding to "hot" or "cold", or it could represent different temperature ranges or even a continuous temperature scale.

Probabilities are assigned to each node corresponding to a belief that the states it represents will take on those values. Where a node is influenced by other nodes, (i.e. it has inputs from other nodes), it is necessary to compute the conditional probability it takes on a given state based on the states of those causal nodes.

Bayes' Theorem is used to simplify the calculation of these conditional probabilities. When a node takes on a given state – for example thermostat with only two states reads "hot" – the probability for that state is set to one and the probability for the "cold" state is set to zero. This information is propagated through the network updating the other nodes to which it is connected, resulting in a new set of "beliefs" about the domain being modelled.

Bayesian Networks can be used in a number of ways. Firstly, the structure of the network and the various probabilities mean that it is possible to use them for predictive purposes. In other words, one can say that given this structure and these facts, event x has y chance of occurring. Alternatively, the same network can be used to explain event x took place because of the influence of events y and z. Reasoning can move in either direction between causes and effects.

Applying these principles to software development at Philips, the team created, and linked, Bayesian Networks for every stage in the lifecycle – from specification through to design and coding, unit test and integration. Using an approach pioneered in previous research projects, the sub-networks for each phase were constructed from a set of templates, leading to an approach that Fenton and Neill dubbed object-oriented Bayesian Networks.

The end-result was called AID (Assess, Improve, Decide). The model takes in data about the type of product (number and scale of components, experience of the developers etc), and other data relevant for each phase of the life-cycle and is able to deliver an estimate of the number defects at any point in the process. The network was validated by using historical data for a number of projects and comparing estimated defects with those actually found.

The results have been very encouraging and the AID tool is being further developed so it can be used in a production environment. One other property of Bayesian Networks is that techniques for "theory revision" – or learning from experience – exist, so that data from each project can be used to refine and improve the network.

Many of the lessons learned from the work at Philips – such as dynamic discretisation of probability intervals - have been incorporated into AgenaRisk, a tool which can be used to build software defect risk models. While we are a long way from having such Bayesian models available as Eclipse or Visual Studio add-ins, the work is progressing in the right direction and once the results start to trickle out from research labs and into the wild perhaps the answers to those hard questions won't seem so shrouded in doubt after all.

Next page: Formal methods, by David Norfolk

Formal methods, by David Norfolk

A requirements specification is, ideally, a rigorous logical definition of a business process, while code is an unambiguous statement of program logic; so, in principle, you can compare two mathematically and prove (subject to Gödel and Turing, I suppose) that the code satisfies the corresponding requirements. If the maths is right, there's no need to test against spec.

This use of "formal methods" is usually thought, by the general public, to only work for trivially small pieces of code, but Praxis High Integrity Systems (Praxis-his or, increasingly, just Praxis; a UK consultancy in Bath, specialising in security and safety critical applications) asked the question some years ago: "Is proof more effective than testing" for industrial scale programs?

It came up with the answer that "proof appears to be substantially more efficient at finding faults than the most efficient testing phase". This implies, of course, that you use both proof and testing on the project, where each technique is appropriate (even though proof is more cost-effective at finding some errors than testing is at finding other errors, proof may not be able to find all errors).

I was impressed some time ago, by the way in which Praxis used its pragmatic combination of formal methods and conventional testing on the SHOLIS (Ship Helicopter Operating Limits Information System) for the UK MOD. See Is Proof More Cost-Effective Than Testing? by Steve King, Jonathan Hammond, Rod Chapman and Andy Pryor, IEEE Transactions on Software Engineering vol 26, Number 8, Aug 2000, here.

I recently went back to Praxis to see how this approach has developed. In the world of formal methods, simply remaining in business with an expanding customer-base is a measure of success, which Praxis has certainly achieved.

What Praxis now has is a named, documented process, "Correctness by Construction" (CbyC): build it right in the first place (instead of the more usual "construction by correction", that is, build it wrong and fix the errors afterwards).

This appears to work: at one level, Praxis now seems able to offer a warranty on its software, for any departures from spec; at another level, its programmers don't bother to use code debuggers, because the code is correct as delivered (you still need acceptance testing, but to show that the system works rather than to find errors).

There is technology behind this – a special language, SPARK, that supports formal verification; and smart tools to compare the formal spec with the SPARK code and to verify the code for completeness, logical consistency and so on - but the technology isn’t the main thing.

Praxis chief technical officer (software engineering) Peter Amey points out that Microsoft has superb technology, using similar mathematics to that behind CbyC, to help identify bugs in, say device drivers, as part of a certification process; but how much more cost-effective to supply formal device driver interface specs and build device drivers correctly in the first place, rather than to certify them after they're built.

The CbyC principles are described in a paper describing Praxis' latest project for the NSA, published in ISSSE '06, the proceedings of the 1st IEEE International Symposium on Secure Software Engineering, March 2006): Engineering the Tokeneer Enclave Protection Software. Roughly speaking, these are:

Next page: more on formal methods

Most of this is conventional wisdom – although not always put into practice – so what makes CbyC different? Well, the system specification is written in Z, a formal specification notation based on set theory and first order predicate logic and developed on the seventies by the Programming Research Group at the Oxford University Computing Laboratory (OUCL).

There is a FAQ here and it has a respectable commercial pedigree: in 1992, the OUCL and IBM were jointly awarded the Queens Award for Industry for the use of the Z notation in the production of IBM's mainframe CICS (Customer Information Control System) products.

Then, the system is written in SPARK, which is a subset of Ada with extra notation ("comments") to support design by contract (pioneered and trademarked by Eiffel), static analysis and program proof.

Praxis has developed tools that help you automate the verification of the specification and the comparison of the unambiguous spec with the equally unambiguous SPARK code.

If the two don't differ, the only opportunity for defects in your system is that the spec solves the wrong problem (you can verify it for completeness and consistency) – the resources that you no longer need for debugging your code can be devoted to analysing the business domain and ensuring that you're solving the right problem.

This really does work, according to Peter Amey, who has metrics (and that in itself is a sign of a mature process) showing a steady decline in delivered defects over the last decade using CbyC and a steady increase in productivity.

"Of course," he says, "we benefit from Moore's Law, all that unused CPU power can power our verification and proving tools."

He seems to be especially proud of the work Praxis did for the NSA: "The NSA concluded two rather interesting things: (1) the formally-based CbyC development was cheaper than traditional approaches and (2) the software we delivered had zero defects," he claims (see Conclusions in the previously-quoted paper here).

Cultural issues

So, why aren't we all using SPARK? There are cultural issues, which mean that CbyC is easier to introduce in a greenfield site. People are frightened of math and proof – and Ada. People whose status comes from their prowess in writing and debugging C++ are unlikely to recommend CbyC to their managers.

And adopting CbyC is a bit of a leap of faith for people unused to proof and formal methods – suppose it is only suitable for simple safety-critical embedded systems and can't cope with the complexity of your business processes?

That last one can only be answered by you yourself reviewing the published case studies here – but how safety-critical, for your career, are the financial control systems your CEO signs off (on pain of a possible jail sentence) to the regulators?

But what about all the modern innovations such as eXtreme Programming and UML (or, rather, the world of Model Driven Architecture, MDA, as UML is just a modelling language)? Does CbyC mean throwing these out? Not exactly, says Peter Amey.

In Static Verification and Extreme Programming (published in Proceedings of the ACM SIGAda Annual International Conference, available here), he and co-author Rod Chapman say: "We were both surprised and pleased to find out how much XP we already do on high-integrity projects."

And, they consider that coding with a human designer and a static-analysis tool such as SPARK Examiner is logically equivalent to pair programming as described by Kent Beck. They posit that the reason Beck doesn't talk about static analysis in an XP context is that the depth it can offer in conjunction with imprecise languages like Java is very limited; and the inefficiency (lack of speed) of static analysis tools not written in and working on something like SPARK can make it infeasible.

As for UML, Amey considers that SPARK confers precision onto the UML model and makes verification of the generated code easier (see High-Integrity Ada in a UML and C World, Lecture Notes in Computer Science 3063 here).

In fact, he believes that using the UML modelling process in conjunction with SPARK formal verification and auto-generation of C from validated SPARK can deliver more robust C. Writing in 2004, however, he considers that "the semantics [in UML alone] are not rich enough for the rigorous reasoning we require in the production of quality software".

However, I believe that this may no longer be so true for UML 2.0, potentially at least, partly because of its well-thought-through metamodel, which is designed to facilitate UML extension; and partly because of the level of semantic detail that can be supported with the Object Constraint Language.

MDA already supports many of the principles behind CbyC (such as generating new deliverables by automatic transformation of previous deliverables, rather than by duplicating and rewriting them), and perhaps the future of "formal methods" (as used in CbyC) for general software development could lie in their incorporation into MDA processes.

For more detail on SPARK, read John Barnes' book High Integrity Software: The SPARK Approach to Safety and Security.

As for formal methods generally, there is a wealth of information at Professor John Knight's University of Virginia website here.

Next page: Summary

Summary

It would appear that computer science is ready to move software development to a new level, so that the idea of software development becomes more of an engineering practice than a black art – although we could equally well have said that at any time in the last 20 years or so.

However, there are still numerous practical problems to overcome. Formal methods require a level of expertise that is missing from many development shops. The range of problems that the techniques have been applied to is also limited – embedded systems, by their nature, represent a limited universe compared to a highly distributed environment with a feature rich user interface. Similarly, the Bayesian approach to testing and defect prediction shows great promise; but to date the work has not been generally applied.

However, without more rigorous, one could say more scientific, approaches, the problems of defective software are unlikely to disappear. And, adoption of these new approaches will need management buy-in to technology risk management.

Research undertaken for HP Services with 10 per cent of the top 250 FTSE companies some years ago has shown that IT risk management is starting to became a board-level concern, perhaps following on from the Y2K debacle; although the IT director still has specific responsibility for this.

The definition of IT risk amongst those sampled is quite discriminating: "We wouldn't really regard it as IT risk...we'd regard it as information security risk or systems development risk," according to one manager.

Nevertheless, although management does now often take responsibility for technology risk management overall, it appears, anecdotally, that the board may not always be fully aware of the risks associated with the lack of adequate testing. This may sometimes limit management support for the radical new approaches that could help address these risks. ®