You're doing Hadoop and Spark wrong and they will probably fail
Developers want the new shiny, users forget integration and then along come the vendors ...
Your attempt at putting Hadoop or Spark to work probably won't work, and you'll be partly to blame for thinking they are magic.
That's the gist of a talk delivered by Gartner research director Nick Heudecker at the firm's Sydney Data & Analytics Summit 2017.
Heudecker opened with the grim prediction that 70 per cent of Hadoop deployments made this year will fail to deliver either the expected cost savings or hoped-for new revenue.
The lack of trained and experienced people will be to blame for those misses and will also mean some face-palm moments once Hadoop is up and running: the analyst said the first question he often hears from new users is how they can actually get data into and out of their shiny new Hadoop cluster. He also felt the need to advise attendees to sort out their data quality and security plans before starting an implementation, as retro-fitting them is common and ill-advised.
According to Heudecker, organisations get into Hadoop and Spark with inflated expectations about what they can do. Neither tool, he said, is a replacement for databases or existing analytics tools.
“One client calls me every seven months and says they are replacing their data warehouse with Hadoop and say I hope they have their CV ready,” Heudecker half-joked.
To succeed with either tool, learn what they're good at and give them a new role your current analytics tools don't do well. But be stern with developers, too as they are “always chasing the new shiny thing” with little regard for wider concerns. The bottom line is you may not need either Hadoop or Spark.
Hadoop, for example, is very good at doing extract, transform and load (ETL)o perations at speed, but its SQL-handling features are less than stellar. It also chokes on machine learning and other advanced analytics tasks because it is storage-centric. That quality means it is expensive to implement on-premises, where you'll need to acquire memory, compute and storage together. In the cloud, by contrast, you can buy compute and storage separately and save some cash.
Heudecker therefore believes the cloud is the natural place to run Hadoop, adding that AWS is probably the world's largest user of the tool by revenue and scale.
The same goes for Spark, which is designed for in-memory processing and therefore makes for pricey hardware. But it's also excellent for machine learning, a workload other analytics tools just weren't designed to handle.
Another factoid to consider is that Spark evolves quickly, with point releases arriving in as little as five weeks. Adopting it can therefore also mean performing frequent upgrades in order to stay secure. Hold your ground and update on your schedule, not your vendor's, Heudecker advises.
One trap for young players that he identified is letting vendors sell you the complete Hadoop or Spark stacks, which comprise multiple packages, not all of which are necessary for basic operations. Paying for just the bits you need is so advisable that leading distributions of both tools now include pared-back bundles.
There's another risk there, he said, because Red Hat remains alone among pure-play open source businesses to crack the billion-dollar revenue mark. Volatility is therefore to be expected in the Hadoop and Spark caper.
But once you train your own people, find a worthy project, get on top of cloud vs. on-prem costs, master security and data quality, get your developers being sensible and work out a sensible relationship with a stable vendor, you have got a decent chance of succeeding.
Who likes those odds? Anyone? ®