Hub and spoke gives data analytics a new spin

Original URL: https://www.theregister.com/2011/11/28/data_analytics/

More tools in the warehouse

Posted in On-Prem, 28th November 2011 23:30 GMT

The idea behind data warehousing is simple: put historical summary data from back-end transaction processing systems in a machine designed to answer queries fast.

The online transaction processing (OLTP) systems run the business, while the data warehousing systems help managers understand the business and steer it better.

The idea of a central warehouse is, theoretically, a good one. Having a single data warehouse to gather up data and do queries also happens to be impossible.

For one thing, computing has grown more complex in the two decades since data warehousing and analytics were an expensive, exotic type of computing done by governments or financial services and retailing companies.

With commercialisation, the internet has moved on from online brochureware and product peddling to complex marketing and social media systems. This generates information that must be not only saved, but processed in ways that are not natural for the relational databases at the heart of most data warehousing systems or the OLTP systems from which they evolved.

Data in a ferment

Further complicating the situation are different types of structured and unstructured data that need to be correlated to make sure the right product gets pushed to the right person, plus the fact that the sheer amount of data being generated by business is growing like a batch of yeast.

And you have to process all of these different kinds of data faster and faster just to keep your competitive edge.

That is why the established database and data warehousing powerhouses have been snapping up business analytics companies, as well as creating appliances for specific analytical functions. Those appliances are not just a cheaper way to do data analytics, they are an admission that the enterprise data warehouse (EDW) concept as we know it is no longer sufficient.

"The idea of an enterprise data warehouse, in my view, still provides a lot of punch for a lot of companies," says Richard Winter, president of Winter Corp, a database consultancy based in Cambridge, Massachusetts .

"Not everyone is able to get there, but there is a very high payoff to getting as much of your data as possible on that EDW platform. As companies have an EDW, they also have other platforms as complements to a central store, now often on outboard data warehousing appliances or Hadoop clusters.

“This is an evolution beyond EDW to a kind of ecosystem. The real world is a mix of stuff from vendors and across vendors."

Just resting

There is no other option, considering the variety of the data that needs to be chewed on. That is why Teradata created its various appliances as analytic adjuncts to its EDW platform.

It is also why Teradata shelled out $263m in March to buy Aster Data Systems, creator of a hybrid row-and-column database called nCluster. This runs on parallel server clusters, like the Teradata database.

Aster Data also created SQL-MapReduce, which combines normal data warehousing of structured data and big data chewing techniques for unstructured data.

These two products are used to analyse clickstreams, social networking connections, recommendations and personalisation on web sites, as well as to monitor churn and detect fraud.

"People are rising above the EDW," says Scott Gnau, president of the Teradata Labs division that designs the company's hardware and software.

"The EDW is not dead. It is growing faster than ever, but it is a part of a logical analytics setup."

Even Oracle, which has made a lot of noise over the past three years about how its Exadata parallel database clusters and Exadata storage servers (with their hybrid columnar database compression) are perfect for both OLTP and data warehousing workloads, has had to concede that not everything can be stored in the Oracle database and processed by Exadata machinery.

In September, Oracle co-founder and chief executive Larry Ellison gave a long lecture about how Oracle doesn't like the idea of storing anything outside the Oracle database, but has no choice.

Oracle speaks

"Oracle has always stored both structured and unstructured data. This is really nothing new. We are constantly adding features to our database to support the storage and searching of unstructured as well as structured data," Ellison said.

He took a jab at rival Hewlett-Packard's $10.3bn acquisition of Autonomy, which makes tools for tracking and sorting unstructured data such as emails and documents.

"We think we are much better off with a couple of smaller acquisitions and continuing to innovate in that area so that the unstructured data and the structured data both find their way into an Oracle database where it's secure, it's scalable, and it runs on Exadata,” he said. “We really don't want to have two separate databases."

Well, that is all very interesting, but the Exa line of machines now has three databases: Oracle 11g with RAC clustering extensions for OLTP and data warehousing; Berkeley DB as the key/value store in Oracle's just-announced Big Data Appliance; and TimeTen in the Exalytics BI appliance.

Oracle also has control of the MySQL database but doesn't seem to know what to do with it.

Eyes on the prize

IBM has always had a plethora of systems and flat-file and relational databases for them, each combination aimed at a different type of customer or workload.

For generic data warehousing, there is the InfoSphere overlay for its DB2 databases for mainframes as well as for Unix, Linux, and Windows boxes.

The Smart Analytics Systems are a combination of the InfoSphere data warehousing code plus Cognos business intelligence software, available on x86, Power and mainframe iron.

Big Blue also has its own implementation of the Hadoop MapReduce algorithm and underlying Hadoop Distributed File System, called BigInsights.

And there is System S InfoSphere Streams for gathering up and managing real-time text, video and audio data in ways that can be used to help run a business.

But even with all of this, IBM needed some other capabilities. Seeing Netezza growing like crazy, the company figured it had better pony up a hefty $1.7bn to acquire the upstart data warehousing and analytics appliance maker in September last year before Hewlett-Packard, Dell, NEC or Oracle could get in.

The Netezza machines are based on IBM's own blade servers, but pair up a special field programmable gate array to do some of the data filtering on a heavily customised PostgreSQL database (most of it long since replaced by custom Netezza database code) to speed up queries by two or three orders of magnitude.

"We tend to be very good at the analytical work," says Phil Francisco, vice-president of product management for the Netezza unit, part of IBM's information management database division.

By this Francisco means doing hundreds of complex queries per second, not the thousands of queries per second an operational data warehouse might do on a much larger set of data. While that operational warehouse can handle lots of throughput, it can take hours or days to run complex queries, and that is not acceptable to a business trying to make an immediate decision.

IBM's theory about data warehousing and analytics involves more of a hub-and-spoke approach. The data warehouse used to be the data repository and the machinery that did the queries to help manage the business.

Now, a data warehouse is still a hub for the kinds of operational data that can be stored in a relational database, and many machines feed off it. But web log and social media data that companies want to correlate with operational data does not have to be as clean as what is put into this hub. Moreover, a conventional data warehouse is not designed to process this data anyway.

"Most people agree that they have reached this point where they can't be responsive to their business uses, and this goes back to the success of Netezza, " says Nancy Kopp, program director for data warehousing and business analytics at IBM.

“If your goal is to accelerate analytics you really need to break these workloads down”

"It's all about accelerating analytics. The question of which databases to use and how they differ is very valid, but if you look at the bigger picture, what is important is how companies become more data-driven and how they build this so it is practical, flexible and more agile.

“If your goal is to accelerate analytics, it is harder to do that with a monolithic EDW. You really need to break these workloads down and put them where they are optimised to both accelerate delivery and to manage capital and administration costs."

The EDW is still the repository where you maintain operational data and where you control access to that data, says Kopp, but it becomes more of a data synchronisation hub.

IBM calls the evolving data warehousing architecture a logical data warehouse. Teradata calls it an ecosystem and Oracle just calls it Exa. But it all comes back to the same idea: picking the right tool for the right warehousing and analytics jobs and not trying to do everything from one machine.

Pick and choose

Compatibility across these various data warehousing and analytics appliances almost doesn't matter because no one is expecting the databases and data types used in business analytics to all be in the same format. This would be logically impossible because of all the unstructured data, and wouldn’t help answer questions better.

So companies are going to have to manage some tension in their data warehousing and business analytics infrastructure, according to Winter.

"To the extent that you can standardise on one vendor, you maximise the skillsets you need, and reduce the amount of integration you have to do and the amount of finger-pointing that goes on," he says.

"But another principle is the right tool for the right job, and that is always a powerful principle. So you might have Teradata for your EDW but choose Aster Data or Netezza or a Hadoop appliance for the analytics, depending on the application."

The issue facing customers, then, will be which tools can be integrated and managed easily.

Meanwhile Oracle, IBM and Teradata will have to get all of their various warehousing and analytics tools working together as seamlessly as possible, with a single set of management tools that allow jobs to be dispatched to the right boxes at the right time. ®