Big Data tools cost too much, do too little
SHOCKING REVELATION: Fashionable technology is high maintenance
Strata 2013 Hadoop and NoSQL are the technologies of choice among the web cognoscenti, but one developer and technical author says they are being adopted too enthusiastically by some companies when good 'ol SQL approaches could work just as well.
Ever since a team at Yahoo! did their turn at being prometheus and brought Google-magic down to the rest of us via Hadoop, companies have been on a tear to put the technology into play. But the costs are high, the effort is great, and the advantage it grants you can be slight, Tim O'Brien said in a packed session at the O'Reilly Strata conference in Santa Clara on Wednesday.
"There is a feeling afoot that some of the technologies we've been talking about at a conference like this end up having a huge price tag," he said.
Citing huge human costs (you need to hire expensive in-demand people who know how to use Hadoop), pricey implementation (migrate your data into NoSQL or HDFS without it going wonky) and the possibility of unanticipated problems (you may not fully understand what you are using), O'Brien poured water on the fiery enthusiasm with which it's been adopted by the tech world and its dog.
Big data is a necessity at scale: if you're trying to listen to every transatlantic phonecall, you need to use MapReduce. ... if you need to search the entire internet in milliseconds you need to use MapReduce, if you need to run the largest social network in the world you need to use MapReduce. If you don't you can probably scale with a database.
The way companies have adopted the gamut of "big data" technologies ranging from MongoDB to Hadoop or Impala, means that their own stacks have become difficult to maintain and hard to understand, O'Brien said. "The things I'm being asked to support in production. ... I couldn't even tell you how many databases they use."
For a few large-scale companies, "big data" products are a necessity. For others, they could be useful tools, but for some adopters, the use of these technologies could be "pushing solutions on problems where they may not be appropriate," he said.
If you've got 10TB or less of data upon which you want to run analyses, then you can still get by on Postgres or some other typical system, he said. But if you're expecting to be logging a PB of data then you need to make your way to Hadoop or something else soon. "Don't wait," he said.
Eighty per cent of the market is driven by the tip of the tech pyramid, O'Brien said. "I'm not trying to say a [Hadoop-using] startup out there is doing it wrong, but I have worked on projects where I wish they'd use MySQL because they've only had a gigabyte of data."
Even Google, the progenitor of all of this technology via the vaunted BigTable and GFS academic papers, has itself moved away from the techniques pioneered by NosQL and Hadoop community via its recent "Spanner" database.
Spanner looks much more like a relational, SQL-style database than anything else, and where Google goes the world follows. This is already happening with other companies, such as TransLattice re-implementing Spanner's structure, and getting much interest because of it.
Perhaps NoSQL and Hadoop have led some companies down a blind alley? The Register's database desk had many conversations at Strata on Wednesday during which companies bemoaned the diversity of the "big data" ecosystem and wished for consolidation to make life easier for end-users.
Companies and technologies have proliferated, as have marketing budgets, and perhaps, as O'Brien's talk outlines, this has gone too far and bitten some novice adopters. These technologies may be big, but they're only as clever as the company using them. ®
Still life in the old dog yet
We've a large dart mart running on Sql Server 2012 and I have to say those new xVelocity ColumnStore indexes are blisteringly fast.
Took 3 minute reports and turned them into sub-second responses.
Still got a few people who ask why we've not gone NoSQL yet. These, of course, the same people who don't have a clue...
Very insightful article Jack. One other open source technology to mention is HPCC Systems from LexisNexis, a data-intensive supercomputing platform for processing and solving big data analytical problems. Their open source Machine Learning Library and Matrix processing algorithms assist data scientists and developers with business intelligence and predictive analytics. Its integration with Hadoop, R and Pentaho extends further capabilities providing a complete solution for data ingestion, processing and delivery. In fact, a webhdfs implementation, (web based API provided by Hadoop) was recently released.
More at http://hpccsystems.com/h2h
Good perspective.. but...
Cannot agree with you more.. At the same time cannot disagree with you more either. I am SQL fanatic and love RDBMS. But, I have to agree that No-SQL brought in a plethora of database "types" like key value store, graph databases, document databses, XML databases. It was paradigm shift from modeling on data persisitance to modeling on data usage. No-SQL databases have got very little to do with size of storage and more to do with applications and even more to do with talent within the organization , openness to a better programming paradigm and of-course performance requirements.
Having said that, I totally agree that going to Hadoop for a few terabytes of data is an overkill. More than the hardware, it's the problem of expecting your existing data analysts to start thinking map-reduce and ending up losing the in-house talent in the process.
And, finally, if you are definitely going big data, I would ask the data analysts/scientists/DBA who will code on the platform and eventually maintain it and derive business value out of it to spend some time on HPCC Systems and the ECL programming language before making a decision. I tried it and I love it and I find it as a good entry to big data with the least change to your mental make-up. I feel it augments your SQL skills instead of killing it.