Big data is not just for the big boys
Analytics goes downmarket
Steps to Take Before Choosing a Business Continuity Partner
It may take a while but eventually any good technology embraced by large enterprises trickles its way down to small and mid-sized businesses in some appropriately modified and re-priced form.
It will be no different for modern business analytics tools. The time could be ripe for mid-range customers to start thinking about either modernising their data warehouses or data marts if they are lucky enough to have any, or come up with a plan to install a business analytics platforms if they don't.
One of the reasons for the success of Microsoft's SQL Server relational database a decade ago is that many of the customers buying the database – as much as a third of all sales by some estimates – wanted the relatively inexpensive SQL Server to set up an online analytical processing (OLAP) server.
That initial OLAP server bundled in SQL Server 2005 opened up a new world of business intelligence.
Fast work
Today’s tools are not only much more sophisticated but are affordably priced for mid-market customers. The level of performance they offer gives these smaller companies what they need to compete in the global marketplace
Everybody is talking about big data these days, but the term is really a misnomer. Fast data is probably a better term. Companies of all sizes are wrestling with making sense of diverse structured, semi-structured and unstructured data sets to help them make quick decisions.
Dell, which does not usually get into markets if it doesn't think it can make a good profit, particularly from the small and medium businesses that it still peddles a lot of its gear to, is cooking up the Quickstart Data Warehouse Appliance. It is based on Dell’s new PowerEdge 12G servers and Microsoft's SQL Server 2012.
Dell says this will be the first data warehouse appliance out the door running the Denali SQL Server 2012. It will also depend on Dell's Boomi service for integrating transactional systems and other data sources into Quickstart.
Not much else is known about this appliance, except that it is in beta testing and is due to be launched in the second quarter of this year.
Place your bets
Meanwhile IBM is betting heavily on business analytics as a key driver of revenue over the next five years. The company is beginning to see some traction with various products in the mid range, according to Nancy Kopp, programme director for data warehousing and business analytics at IBM.
Depending on the type of data and analytical applications that hit against it, mid-market customers tend to go with one of two IBM machines right now, in the wake of Big Blue's $1.7bn acquisition of Netezza in September 2010.
In July 2009 IBM launched its Smart Systems, which are clusters of Power or x86 server nodes equipped with operating systems, IBM's General Parallel File System and Tivoli System Automation to manage each node.
Some of the nodes ran Cognos modules, including BI Server, Go Dashboard and BI Samples, and others ran IBM's InfoSphere Warehouse variant of its DB2 database, merging data warehousing and analytics all in one cluster.
"The more bundling you do, the more favours you are doing"
IBM gradually fleshed out the boxes and even created an entry machine called the Smart Analytics System 5710, which pairs up an IBM System x server with a DS3500 array and the Cognos and InfoSphere software stack, all for a $50,000 price tag and configured as an appliance for companies to dump data into and chew on it.
"The more bundling and integration you do, the more favours you are doing for the mid-market," says Kopp.
Some mid-range companies have quite large data munching jobs, and for these customers IBM has created the Smart Analytics System 7700. This uses IBM's Power 740 servers, based on the Power7 Risc processors and similar to the nodes used in IBM’s Watson machine, which competed in the TV quiz show Jeopardy! and won.
The server is configured with IBM's AIX Unix variant and InfoSphere Warehouse Enterprise Edition data warehouse plus Cognos business analytics tools for drilling into the data warehouse and extracting reports. The Smartie 7700 uses DS3500 storage arrays to house data.

IBM's Smart Analytics System 7710
There is a variant of this machine called the 7710 designed for data warehouses that are under 10TB In size, which would be particularly useful for mid-range shops. This pairs one Power 740 with three DS3500 arrays with the same InfoSphere and Cognos software stack.
Serious shopping
IBM has not yet bundled SPSS’s predictive analytics tools with the Smart Analytics Systems. These are obvious add-ons and explain why Big Blue paid $1.2bn back in July 2009 to acquire that business intelligence software firm.
And of course, IBM Power Systems shops that prefer the IBM i operating system can get the combination of the DB2 for i database and the DB2 Web Query tool, developed in conjunction with Information Builders, to build data warehouses, execute ad hoc queries and generate reports.
In Europe, a fairly large company might only need an analytics system that would qualify as a mid-range box in the US.
That is why Netezza created a cut-down version of its data warehousing appliance, called Skimmer and sold as the Netezza 100 series.
All Netezza data warehouses are based on IBM's BladeCenter x86 blade servers, but they are goosed for data warehousing and analytics by a special field programmable gate array co-processor.
Netezza created this to speed up the heavily modified PostgreSQL database that runs on top of the iron. (Netezza chose IBM iron long before it was bought by Big Blue.)
The Skimmer machine hit a $125,000 price point for 10TB of user data capacity, which was a bit more than the Smartie 5710 box but considerably less expensive than an entry Netezza 1000 appliance. This has more processing oomph and would cost about $200,000 for a similar configuration.
There is a possible cloudy future to analytics in the mid range, and IBM could be pointing the way. In February, the vendor completed its $440m cash acquisition of retail analytics software provider DemandTec.
Here's the interesting bit: DemandTec offered its software on private slices of its own cloud, which was backed by Netezza iron, as well as allowing customers who wanted their iron and software on premise, and had the cash to pay for such a luxury, to bring it inside the corporate firewall.
Google has the answer
But mid-market companies that want to do sophisticated analytics may not want to own the iron so much as run the algorithms against their data.
That is certainly what Google thinks will happen for many customers, which is why it has launched BigQuery. It is in beta testing now and available on an invitation-only basis.
Google says the BigQuery engine will be able to scan billions of rows of data in seconds and scale across terabytes of data and trillions of records – and use an SQL-like query language to kick off the data munching.
And if companies don't want to get their hands dirty sorting out BigQuery, then there will be service providers that sit in front of them, masking the complexity.
Bime me up
We Are Cloud, a startup founded in southern France by Rachel Delacour and Nicolas Raspal, has created a front-end for BigQuery called Bime (pronounced "beam") as a business analytics tool that runs on Amazon's Web Services compute cloud and stores data in Google BigQuery.
The Bime service comes in workgroup, enterprise and premium editions, costing a mere $60, $120, or $240 per month, with ten users and varying features on dashboards, connectors, storage and dataset row counts.
The company has 200 customers, most of them are outside of France, and the service is available in French, English, Dutch and Chinese, with other languages in the works. It is designed for sharing data and query results through dashboards and other graphical representations.
"Traditional on-premise business intelligence tools are not inherently collaborative or cost effective," Delacour said, introducing the Bime front-end at the recent Structure Data 2012 conference in New York.
"Cloud solutions are, even though they are not necessarily good at delivering performance on all data sets."
It probably beats trying to do business intelligence in Excel, which is what most mid-market customers are still trying to do. ®
COMMENTS
Big data...
...is apparently defined by volume (how much), variety (what types) and velocity (how fast), or some combination of all three.
The term is in vogue due to the likes of Google, Yahoo and Facebook introducing the world to new analytic paradigms based on the MapReduce framework, open source software (Linux, Hadoop etc), commodity hardware and the notion of 'noSQL'...and also because the IT industry needs new buzzwords du jour. At the moment it's the turn of 'big data' and 'cloud'.
In theory, 'big data' as done by the likes of Google is all about unstructured data. In reality, there's a lot of structured data still out there, and I'd argue that all data has some structure anyway, so 'semi-structured' may be a better term.
Ebay has a multi-petabyte 256 node Teradata system chock full of structured data, in addition to the large Hadoop stack for web analytics, so there's clearly life in the old structured dog yet.
There's nothing new in 'doing analytics' - a lot of companies have regarded analytics as a competitive differentiator for a long time. There are companies out there, even in the lil' ol' UK, that have been using Teradata, which only does analytics, since the 1980's. I started my career at one of them.
For the typical mid-market company, if there is such a thing, all we ever tend to see is SQL Server on top of SAN/NAS. It's cheap, feature-rich, easy to tame and works OK until data volumes increase beyond a few hundred GB or so. The pain threshold is obviously dependant on the hardware, DBA/developer skill, schema and application complexity.
All SMP based databases suffer the same scaling issues, hence Microsoft's attempt to build an MPP version of SQL Server, (Madison/PDW), Oracle's Exadata and HP’s NeoView.
IBM in the BI mid-market is not something we see very often. Netezza Skimmer has never been sold as a production system before, as far as I know. IBM's own web site describes it as for 'test and development. A proprietary IBM blade based system running Postgres on Linux is hardly a good fit for the Windows/SQL Server/SAN/NAS/COTS hardware crowd.
Having said that, we did deploy a pre-IBM Netezza system as far back as 2003 for a small telco with only 100,000 customers, but they did have several billion rows of data and complex queries to support.
@Wonderbar1 - Teradata's competitive advantage consists of several capabilities...performance, scalability, resilience, functionality, maturity, support, 3rd party tool integration (e.g. in-database SAS), ease of use, applications and data models to name a few. It’s a true ‘full service’ offering.
Teradata is the only database built from day 1 (in the 1980's) to support parallel query execution using an MPP architecture across an arbitrary number of SMP nodes all acting in tandem as a single coherent system. That is very, very hard to do - ask Microsoft, Oracle, HP or IBM.
Overall, Teradata 'just works'. All those big name users can't be wrong.
The Teradata secret sauce for me is the scalable 'bynet' inter-node interconnect. This is used for data shipping between SMP nodes in support of join/aggregation/sort processing. The bynet is scalable and resilient and 'just works'. It also performs merge processing for final results preparation.
Other MPP systems typically have a non-scalable interconnect bandwidth consisting of a dumb bit-pipe. Even worse, those that ship intermediate results to a single node for final aggregation/sort/merge processing can hardly claim to be linearly scalable. Some Exadata clusters run tens of TBs of RAM on the master node to address this issue.
Teradata's bynet has processing capability that enables final merge operations to be executed in parallel in the bynet interconnect fabric without landing intermediate results in any single place for collation. Cool eh?
See here for more info: http://it.toolbox.com/wiki/index.php/BYNET
Teradata consists of OEMd Dell servers running SUSE Linux and dedicated storage from LSI or EMC. Teradata was historically regarded, quite rightly, as 'reassuringly expensive', but the launch of the new line of Teradata 'appliances' a few years ago has made Teradata price-competitive with the likes of Netezza, thus eroding Netezza's disruptive pricing model. Competition is a healthy thing etc.
Appliance adoption has been a key feature of Teradata's strong performance over the last few years, as reported several times on El Reg.
Have you ever run an Oracle query across a 20 node system running hundreds of virtual processors all working together? I did a few minutes ago - a 250m row count(*) in under 1 second with no caching, no metadata, no indexes, no tuning, no partitions and no concern for what else is running.
I can't remember when I last submitted a query to Teradata that either didn't finish or caused the system to barf. That happens a lot on Oracle/SQL Server.
The last project I worked on was a 20TB Teradata system that supports a very wide range of applications, including real-time loading of web data and several tables of over a billion rows. Total downtime for the year, including planned maintenance, is measured in single hours.
“But I could do all that with X, Y and Z”, we often hear. Off you go then. If you can get it to work, and that’s a big ‘if’, your boss won’t bet the farm on it. That’s another reason the likes of Teradata win business – it’s a safe bet for the decision makers.
Back to work…
Uhm.
Tim can wax a bit lengthy on what essentially are collections of sales pitches, and at the end I always wonder whether I've wasted my time or not. Unless I've stopped reading halfway through. But I digress. This time, there's a notion that big companies in yurp only count as midsize companies in the yoosah, data wise. 'merkins just munch that much more data? What gives?
The minor lightbulb moment was that the BI crowd tends to stick to spreadsheets, which merely means they haven't brushed up on basic IT skills, opening up a lucrative market for data appliances. But what if you did do your homework and asked the IT dept for an OLAP instance with enough backing store for all that data? Suddenly you can connect with data viewers, visualisers, even statistical and mathematical packages and whatnot else.
Basically FOSS versions of everything can be found around the 'net, too, so all you need then is the hardware, and you let the IT dept handle that. What's wrong with this picture? Is business in general that dysfunctional? Rather give someone else big sacks of dosh for a magic solve-all box than do your own homework? Doesn't sound very intelligent to me. What is the point?
Teradata
Why do people buy Teradata these days? Not slamming Teradata, but a serious question. It seems that IBM and, if you like lock-in, Oracle have mirrored their OLAP appliance and then some. They used to be the only game in town, but I am not sure why you would go with their costly and proprietary appliances today. What is their secret sauce?

IT infrastructure monitoring strategies
Requirements Checklist for Choosing a Cloud Backup and Recovery Service Provider
Cloud based data management
Enabling efficient data center monitoring
Agentless Backup is Not a Myth