Number-crunching in the Cloud
Wave BY:BY to old-school data analytics
Back in the mid-nineties every PC in your organisation potentially contained software that could destroy your company overnight. Not a virus, nor a Trojan: it was called the spreadsheet.
The spreadsheet was – and still is – broken by design. The vital raw data it crunches may be exposed to view, concealed behind the cells or parachuting in from somewhere outside the spreadsheet.
Mixed up in the same visible or invisible cells is the business logic used in its calculations. And unless rigorously locked down, each copy of every spreadsheet in circulation can be individually "improved" by its user, intentionally or by accident.
But the output from the spreadsheet, on the screen for the local worker, or in a paper report sent up to the boardroom, looked really nice, really authoritative. Which, of course, was the problem.
Today's tsunami of data input, and the changing requirement to get output directly to the decision makers in the field, has largely put paid to that kind of spreadsheet. Tom Nolle of Cimi Corp remembers the days when vital decision making depended on "ten thousand spreadsheets within the worker population".
Today we try to pull that all together to get a handle on data integrity. The subject is one of Nolle's specialities. Cimi Corp is a strategic consulting company that assesses trends and tries to build a picture of the future of telecommunications, media, and technology (TMT).
The 'future' of business analytics
In a web-based teleconference under the aegis of Internet Evolution, Nolle maps out the changing landscape of business analytics. "There was a time when business intelligence meant sending the information to the boardroom," he says. "Today it means sending the information to everyone in the company that has to make a decision. And they need the data when they make the decision – not some time in anticipation, and not when it's too late."
What data? There's all the traditional stuff your in-house data warehouses have been collecting for decades – raw material and output figures, customer satisfaction scores, employee churn per region, cost of sales... This data is abstracted, digested and projected in ways that probably have to be evolving rapidly as your business changes. But now there's new data to add to the mix: valuable demographic and other market information coming in over the internet.
Dave Suedkamp, head of everything for IBM's market research services, chips in: "Facebook, Twitter, news feeds and other social media, message boards, forums..." This new inundation from the Cloud can't be ignored by businesses trying to make a buck in the 21st century, he says. To make sense of the world you need to digest it all.
Ishan Sehgal tweets, in and out of his job as program director of software as a service for predictive analytics at SPSS, a company IBM acquired in 2009. "The amount of social data out there is increasing beyond measurement," he tells the web conference. "The overall amount of data currently stored in the world is estimated to exceed one zettabyte." Que? Count the number of grains of sand in a thousand worlds like ours, and there's your zettabyte.
If you're wondering about the weight of IBM input here, it's worth noting the Internet Evolution website is "sponsored" by the company. But there's not too much in the way of marketing hype in this particular conference. Inevitably a fog of abstract jargon hangs over the occasion, but there are some possible insights to be gained.
Next page: Big Money from 'Big Data'
The nicest tie wins
> What do we really know about the provenance of this kind of data?
All this stuff tells us is that nothing much has changed.
We might have zetabytes of stuff flitting around in some cloud, somewhere - so what? There isn't enough time in the world (and certainly not before lunch) to analyse it all, so people fall back on the methods they've been adopting since the beginning of time (or at least 01-Jan-1970 00:00:00) and judging the person making the presentation and the credibility (read: prettiness) of the slides/powerpoint/OHPs/report/webcast.
People buy from people. Managers make decisions based on the credibility of the person presenting to them. So when all is said and done, forget the accuracy of the spreadsheet - nobody is in a position to question it, or understand it. Just make sure your shirt has been ironed.
You'll never kill off the spreadsheet
The whole drawback of them (any user can "improve") is the reason why they'll never disappear. Management like the fact they can change them all the way up until things go badly wrong. Given they are the decision makers as to what stays or goes it'll be staying. Indeed the City does run off the back of Excel for quite a few things with some organisations being more blase than others in their usage. Believe me, I'm working at a joint with no online or real-time source of truth position keeping system but rather a series of excel spreadsheets connected to an Oracle db via SSRS and compliance performed on a T+2 basis using this same shitty infrastructure. Company concerned is only dealing with $50bn though so no problems :-(
I'd also disagree that it is broken by design. It works as intended - like a cross between a scrapbook and a calculator. How it is being used is probably wrong.
Almost interesting, except that I didn't understand a word of it.
My local NHS PCT has a problem with spreadsheets. They use them inappropriately, and as a substitute for proper data analysis. I know that you can fool 98% of your users by shipping them vast amounts of unusable data, but I'm pretty sure that they do it simply because they don't know any better.
Anyway, I've written the code to do it properly, with not a spreadsheet in sight. I want to put it in "the cloud" somewhere. I'd appreciate some practical advice. I need lots of CPU cycles, but only occasionally, and more storage than the average web app. I need to keep a database of perhaps a gigabyte in memory without swapping. I need to do the sysadmin myself and I might need to scale up to a second server if anyone actually uses the code, without dicking about with all the DNS stuff myself. Now *that* would be a useful article.