Original URL: http://www.theregister.co.uk/2013/08/27/data_mining_for_the_ppi_scandal/

Loads of mis-sold PPI, but WHO will claim? This man's paid to find out

Data mining to fathom the depths of banking's balls-up

By Bob Dormon

Posted in Applications, 27th August 2013 11:05 GMT

Feature When the opening line of a conversation starts, “I read an interesting number the other day”, it’s fairly safe to assume that you’re talking to someone whose business it is to know about "interesting numbers". Perhaps unsurprisingly, these words were uttered by an economist whose ability to find god gold in the numbers is the reason why he’s been working for one of those very naughty High Street banks to figure out just how much the PPI scandal is going to cost it.

These days, the banks are very sensitive to any kind of media exposure, which is why this data miner has asked to remain anonymous, so we'll call him Cole.

SAS Enterprise Miner tool in use for banking analytics

Diamonds in the data: SAS Enterprise Miner tool in use for banking analytics
Click for a larger image

Cole, who has a background in analytics, has to dig into some big data: his client has a hefty slice of those policies to work through, contracts running into the millions.

It's estimated that all the High Street banks combined have 20 million PPI policies to deal with (not all of which will have been mis-sold), but not all of the policyholders are going to play PPI bingo, and that’s the catch. If every case was genuine and everybody applied, all the banks concerned would know what it will cost them straight away.

The fact is, not everyone will be bothered to follow up on compensation for the Payment Protection Insurance they were missold. The mis-selling of said PPI has rocked the banking industry since the major rumblings on this massive financial fiasco began back in 2005. And not knowing what it's going to cost is troubling for the banks. So they're been using data mining techniques to ascertain the types of customer likely to seek compensation to derive more accurate estimates, which is where Cole's expertise comes into play.

Incidentally, the interesting number he was talking about was the statistical claim that, at the moment, 90 per cent of the data stored on servers worldwide was collected within the last two years. In this business, the term "growth industry" appears to be a huge understatement.

Folk record collection

What makes the PPI models rather more involved – as compared to trawling text from tweets and peeking at the contents of your shopping basket – is that the data is historical. It goes back 20 years or more and involves the collation of records that have been migrated from systems long since dead, together with hard copies that have to be scanned in too. Lest we forget, he also needs to consider various bank mergers and their seemingly requisite system incompatibilities along the way. This isn’t the neatly packaged analytics of the today’s e-commerce, it’s a bit of a mess and needs meticulous handling.

So who has been keeping this information? Where does it all live? Just how do you turn up at your desk one day and begin the task of mining data from 20 million records covering two decades?

Cole offers some background to this accumulation of records and its current use today. He sees the arrival of big data as developing in several stages.

“In the 1990s and after the millennium, big data was collated in data warehouses as relational databases. Consultancies earned a lot of money in the 1990s from building data warehouses - collating all transactional data, customer data (all sorts of data). After that came a period where in the last five to 10 years the focus has been more on the applications to utilise almost all of the data. And I come from the applications angle.

Data mining flow chart

Moving the masses: Data mining information flow chart

"Then the next phase I see, to get to the data-mining part, is the exercise where you get all the data into a format where you can actually start analysing it. Big data, as it is, is not really fit for purpose in terms of getting inside analytics out of it. So analysts tend to build their own data marts on their own computers.”

If you’ve never heard of a data mart before then you’re not alone. Analysts can work from the data warehouse content, but creating a data mart is the way forward: syphoning off an specific range of data and narrowing down the areas you are interested in analysing – for instance, certain companies, time periods or particular regional locations.

There’s direct access to the data on big servers from TeraData and other sources such as Oracle databases and the like and, depending on what it is, several types of data mart are built and, needless to say, the work takes place on copies.

Start with a mart

“We have built a data mart at the bank specifically to cover all the PPI analytics,” explains Cole. “It contains all the bank's PPI accounts that have been sold and all related data to those policies – millions of records. In the data warehouse, there are a lot of data sources – different corporations and all sorts of different data formats coming in. You then collate whatever you need for your particular project or objective. You then build your mart for specific tasks – marts are not permanent.”

The way the data is handled varies between the analytical data and operational phases of the work and the consequently there is the analytical data store (ADS) and also all sorts of operational data stores (ODS). The information in the latter is acted upon and used for various campaigns and targeting specific types of customer. Hence, the final phase is about implementing analytical tools that can make good use of the actual data.

“For analytics, what you’re looking to get is maybe not all of it but definitely the full breadth of the data, so you may not need every single record. Then, when you get to the operational side of things, where you deploy your analytics, you may only need a much smaller part but you will need that for every customer.

"So in terms of the storage with our mart we’ve built here is half a terabyte, and I think we’ve used 95 per cent of that space and there is a upgrade underway. As you can see, it quickly adds up. But analytics is not really so much dependent on size and storage, you can do analytics on small pockets of data, it all depends on what you really want to get out of it.”

Called to account

While all these banks seek to utilise big data to both harmonise accounts and clarify their position regarding the PPI payouts, one of the major players has another big data task on its hands. The UK government bailed out Lloyds during the financial crisis of 2009 with 43.4 per cent stake in the ailing bank. However, European Union law regarded this as state aid and demanded a sell-off to comply with competition rules.

In a project codenamed Verde, Lloyds set about divesting some 630 branches. Its attempts to sell them off to the Co-operative Bank failed recently, as this potential buyer got cold feet in this current economic climate. Yet Lloyds continues the work unabated and intends to offer this ready-made bank, branded TSB, as an IPO instead.

Lloyds has had its own PPI issues to deal with, but this an entirely different project and is nonetheless interesting as it is the reverse of merging – a necessary process in order to select the 630 branches for the sell-off and find out what they can do with the customers. Lloyds has even set up its own bank transfer website to explain the situation to its various account-holders.

Cole has his own take on the issues that this task involves. "Customers don’t think of a branch, they think they are a customer of a bank. Now they’re going to be banking with a newly formed company. So there will be cases where you have a joint account, your wife can have an account in another company, but you have a joint mortgage and things like that, it’s massively complicated. From what I can see, it’s an equally difficult exercise to split up the data as it is to merge it.”

Quality control

Regardless of whether you’re separating out the data or bringing it all together, data quality is the biggest issue that needs to be addressed before any major number-crunching begins. Cole also speaks of "holes" in the data, where information is missing – such as home address or date of birth. “You’d be surprised to see how many unknown genders there are. That’s interesting.” says Cole.

Determining different data classifications is another aspect that clarifies the picture that’s being built up around a customer. Cole says he usually distinguishes between two types of data: behavioural and profile. With behavioural data it’s typically an accumulation of transactions relating to customer activity, such as purchases or website visits. It ends up in a database and that remains unchanged, and simply builds up over time. According to Cole it’s probably the most valuable source of data that can be collected.

“You can ask someone how often they shop in that supermarket and they will say once a week or twice a month but behavioural data will show exactly how often they shop and what they buy.”

By contrast, profile data or research data is data that can change. Marital status, where you live and what you do. Working on filling in these gaps is just one aspect of a data-mining project, as Cole explains.

“One part of the process is to try to make your data better. So where there is missing information, you try to guess. This includes what the gender would be or if you don’t know the income for that person, you make an estimate or you model it based on all the information [you have] on all the other customers.”

This goes beyond just using a post code but can refer to particular spending patterns. When it comes to filling in the gaps, nothing goes to waste. While there are exceptions, it’s far too time-consuming to laboriously go through every customer profile with missing income details to fathom out a likely figure.

“That’s where the data mining comes in,” says Cole. “You would then build algorithms that will use all the data to make that prediction. Alternatively, you can look at the average or examine a certain range of data – there are a lot of different ways to approach it. In your application you are cleaning the data. That means filling out the blanks and simply checking for errors. For example, a phone number typed into the age field, things like that. Looking for outliers. Again in the analytics you’re interested in the breadth, but you’re also interested in what is coming across.”

Meaningful relationships

"Data quality is the biggest issue when you start getting into your task and working with the data. You have a lot of data and you look for relationships, but if you then have something extreme [an outlier] appearing then that could change the whole relationship and create an inaccurate picture. So it’s all about cleaning. Then, by creating these other factors from thousands of data [fields], you’re creating a more manageable amount of factors. You know what you’re looking at in terms of data on the screen.
Revolution Analytics R Studio application

Coding with the open source Revolution Analytics R Studio
Click for a larger image


"The big exercise in the data prep is to get to understand the distribution in the data and the variation. You need variation between two things in order to assess if there is a relationship. If there is no variation, you can’t really say anything from that data. So there’s a lot preparation going on and you’re also normalising data – you’re splitting it up – all sorts of statistical things. You have to massage the data to put it in a form that you can use to run your algorithms. All that you’re doing is programming, writing code.

"You then run your algorithms and select your best algorithms. You get statistics on your screen and you make decisions – it’s often a rigid process. The output could be a credit score or a just number. Or it can be a segment which you would then profile after that. You would send that segment to a marketeer who would then come up with a fancy name for it.

"There’s an operation, a commercial aspect and there’s an insight. And you always try to gain insights because that will help you next time you do the same exercise."

Prophecy and loss

For the PPI work, the servers get reloaded every week, but other projects might run daily. If you’re handling historic data, namely decades-old insurance policies, you might ask yourself how fresh data can assist you. Yet for many of the bank's customers, their PPI policy will also have a separate account from the bank attached to it, and this is a rich source of behavioural data. It’s a way of understanding who you are dealing with: are they likely to apply for a PPI refund or will they let it go?

Cole adds: “Recency* is a very important factor when you are analysing data. If you want to figure out what a customer would do in the future, the more recent behaviour is usually a much better indicator of their actions. A lot of the work is about trying to figure out what is going to happen in the future by looking at what happened in the past. That’s a typical domain for data mining and data mining analysts.

"For example, what I’ve also been involved in is to try to figure out if people are likely to default on a loan. So [you] look at a similar group of people, how they’ve behaved in the past and you make your assessment.”

And it is precisely this capacity of big data to reveal the likely actions of vast numbers of customers that the bank has tasked Cole and his colleagues to work with in order to estimate the cost of PPI. If you can determine how certain groups of people are likely to behave then it helps reduce the guesswork involved, so that realistic figures can be delivered that marketeers and investors can swallow.

Cole has his own example of how recency has assisted his PPI work. “In this case, we have figured out the more recent the loan, the more likely there is going to be a mis-selling complaint. So that’s an important driver in order to predict whether there would be a complaint or not.”

But not everyone will complain, so surely the bank can take it in its stride as complaints ebb and flow. Not so: all the banks involved in the PPI scandal have a serious incentive to get these complaints of mis-selling dealt with as quickly as possible, as Cole explains.

“The commercial aspect here is that customers are earning interest on that PPI premium that they’re going to be repaid. So the banks have a vested interest in trying to get these complaints sorted as quickly as possible. They are paying 8 per cent interest.” He adds, jokingly, “If you have been mis-sold, it’s the best savings account you can have.”

Tools of the trade

SAS Visual Analytics Mobile BI iPad app

SAS even has an iPad app: Mobile BI displays visual analytics
Click for a larger image

As data mining continues to grow, many recruitment agencies are now specialising in finding personnel with these skillsets. As you can imagine, how highly sought after you become is driven by the applications you can use and what sort of applications the companies have installed. If there was one application to learn to give you a start in analytic work then Cole suggests you take a look at SAS.

“SAS is something that they teach at the university that I went to,” says Cole, “and the company is probably the biggest supplier of statistical analytical software. There are other tools also, but for statistical analysis, you should know it. It involves using standard programming tools, as most of the work is done in programming, and you can build application runs on top of that for other people to use.”

SAS products don’t come cheap and the portfolio covers a huge range of business analytics applications. The site is worth a visit as it features numerous tutorials and the odd demo, but perhaps the best way to get your hands dirty and do some number-crunching is to consider the open-source alternatives such as R from Revolution Analytics. Cole is a fan too.

“I’m also teaching myself R. It is more specifically aimed at statistical analysis and given that it’s open source, anyone can download applications or if they’ve developed one, they can upload it for everyone else to use.”

Revolution Analytics R Community application packages

Revolution Analytics R has a plentiful supply of packages
Click for a larger image

It’s this aspect of R that appeals to Cole, as it has the potential provide him with a much larger toolset that’s specifically designed for statistical analytics. “In SAS,” he says “the main tools have a lot of functions, but then you have to build your own applications.”

Using Revolution R may well prove to be a useful vehicle for evangelising the benefits of data mining for companies that aren’t permanent members of the FTSE 100, as he explains.

“My initial idea is you would be able to take this type of analytics to smaller companies that cannot afford to invest in the big applications. These businesses have accumulated a lot of data in the last two to 10 years and have their own small big data. Many online companies have a huge amount of behavioural data from customers visiting and shopping on their sites too, but they don’t have the money or the skills to use the data they have collected.”

How these small companies would utilise their data caches remains to be seen but there's no escaping the fact that if you do something that can be logged, then somebody out there will be interested in knowing about it and prepared to pay to find out.

The Community version of Revolution R is freely available for Windows and Red Hat Linux 5 in 32/64-bit flavours. It installed without a hitch on Window 8 running on an Acer Aspire P3 Ultrabook here at The Reg. At a glance, it looks very much like an application that’s designed for people who are well versed in the dark arts of statistical analytics.

Monetary policy

Needless to say, the banks have been keen to bring these cases to a close and are duty-bound to contact a large proportion of these policyholders and send at least one letter. In theory, they should contact all policyholders but there is an agreement now to target policies from 2005 until 2011, when the selling of PPI products ceased.

“My task now is to look at the complaints we have already had to try to predict who are more likely to complain than others, and then focus on those policyholders to speed up the process, which would be in everyone’s interests.”

FCA figures for PPI payouts since January 2011

FCA figures for PPI payouts since January 2011: the total hit £10.5bn in May 2013. Mis-selling may have been rife, but the banks have had to make massive compensations, and the claims continue to this day.

Apparently there’s no deadline for these cases either. Even though the British Bankers Association appealed, the High Court ruling in April 2011 was against the banks. In the various court wranglings, the words "unfair relationship" and "mis-selling" have been used to describe the impact of some PPI policies on unsuspecting customers.

In a way, data mining is being used to calm nerves as it identifies groups of likely claimants, so now board members don’t lose sleep worrying over the losses if all the policyholders applied for refunds. After all, the task is not simply recording and reporting the structures you already know about – that’s just a matter of numbers going up and down. By contrast, data mining is about looking for patterns that you were not aware of or wouldn’t have thought of.

“You can look for correlations in all sorts of lifestyles such as the sale of goats' cheese and wine in a supermarket, but that doesn’t necessarily explain anything. One could be the function of the other or vice versa – or both of them could be a function of something completely different."

Confessions of a cash machine

Cole adds: "What I found is that customers with a deeper relationship with the bank are more likely to complain. For instance, customers that were using ATMs more – checking their account balance regularly – and those who have had more products over a lifetime (more active products), things like that. The ATM connection isn’t something you would have thought about up front but the data mining put the spotlight on this.”

These kinds of insights become apparent as data is handled in different ways to refine the results. “Data analysts tend not to have the greatest discipline when managing the data they use. I think it’s inherent in the way you work, as you need to be open to new ways of looking at things. It’s not called data mining for nothing,” says Cole. “It’s a good term because it illustrates that you don’t always know what you’re looking for.” He describes a typical scenario and how data mining insights are adopted by a company.

“So you have your data and you’ve identified pockets using a statistical algorithm. The next job is to try to profile it: what’s the age, gender, where do they live? If that’s what the marketing requirement is. If it’s a matter of risk then you’ll see what is their income, when did they last default on a loan?

"When we get to the application stage, there are often three areas. One is the direct application of your output, so whether it’s some sort of segmentation or credit score or ranking of customers likely to complain, that would go into the operational parts (and be used operationally by those doing the campaigns).

"Then there’s the commercial aspect of it. What’s the impact on the profit and loss, and the revenue? By how much could the bank reduce its burden of interest payments if it were to push forward a lot of complaints? And then there are the insights of gaining customer understanding.

"In this case, it’s to discover who are the complainers or who are the more risky people. When you get to this stage you need illustrative PowerPoint slides, of course, with a few data points and a few charts.”

The insight track

For Cole, insights that come from data mining are not simply received on the nod. Presentations are made to internal stakeholders and, when you're dealing with a major High Street bank, acceptance of your findings involves numerous steps along the way.

“There’s often a lot of spin work to be done when you have created insights to actually get it into the organisation,” he concedes.

So after all that effort in the search for insights using big data derived from 20 million records – along with analytics and algorithms – at the end of the day, there’s still a person in a suit who decides if it’s actually of any significance to the company or not.

It’s a pity that we don’t have those people in banks any more to decide the merits of individual customer loan applications. Instead, data mining of your past will determine your future and, just as High Street banking has found out with the PPI mis-selling, these things can come back to haunt you. ®

* Our dear, dear grammar gurus: Yes, it's "industry-specific" terminology, which essentially specifies how much further back in time the data capture is compared to the present moment (which is also constantly shifting forward through time)... And while you could say "how recently it was captured" instead of saying "recency", it takes more words to do the former, and statisticians and analysts appreciate the brevity of the latter. Just accept it and move on. We have. – Sub Ed