Feeds

Machine learning climbs atop Hadoop

Pattern hoists machine-learning models onto HDFS

Boost IT visibility and business value

Hadoop whisperer Concurrent has released a free tool for porting machine-learning models over to Hadoop.

The Pattern tool lets you run machine-learning models on top of the Hadoop compute and storage framework via either exported Predictive Model Markup Language (PMML) files or a Pattern Java API.

Designing machine-learning models requires a precise set of skills, and though the technology can bring great efficiencies by creating automated programs that can, say, automatically score query results by relevance, it is rare that machine-learning experts – who are a subcategory of the data scientist breed of tech bod – are also familiar with the vagaries of MapReduce jobs.

Rather, many data scientists work within the confines of mathematical or machine-learning programs such as R or MicroStrategies – and it can be a tall order for these people to learn HDFS and MapReduce sufficiently to re-implement their algorithms on large HDFS-stored datasets.

With Patterns, Concurrent has created a free technology that can take machine-learning models exported into PMML files and run them atop Hadoop. "You should be able to export from your favorite tools your PMML docs and get into production at least at scale," Concurrent founder Chris Wensel says. "The goal with Pattern is to be able to apply a [machine-learning] scoring model and run it at scale."

Pattern is the third prong in Concurrent's pitchfork for getting useful data in and out of Hadoop without having to learn the vagaries of the application. It sits alongside the company's Java API for Hadoop and its Lingual add-on for making SQL queries on Hadoop easy.

The tool is designed for data scientists who are unfamiliar with Hadoop but want to use the technology to run machine-learning models against large pools of data. It works with any program capable of exporting a model as a PMML file – R, MicroStrategies, SAS, and so on.

"We've used the Cascading APIs and implemented the scoring aspect of these models against the cascading APIs," Wensel says. "It'll generalize itself thanks to the facilities Hadoop provides. If you export the model from R into PMML and run [it] across Hadoop, it'll parallelize itself appropriately."

Pattern is part of Concurrent's overall strategy of shifting Cascading into an all-purpose translation layer for people who want to access the inherent scalability of Hadoop without having to invest time in learning its peculiarities.

Its closest contemporary would be the open source Apache Mahout project. However, Mahout is more a selection of HDFS-compatible machine learning algorithms than anything else, so it lacks the flexibility and tooling that software like R may have.

"Mahout is a set of standalone and independent applications that have to be orchestrated with other applications to do their job, each using different file formats," Wensel says. "This is fundamentally very brittle and adds lots of latency to the applications."

The company expects existing Cascade users such as Airbnb will start experimenting with the Patterns tool imminently. It is already in use by AgileOne.

Over time, Concurrent hopes to build an ecosystem of complementary tools for Hadoop around the Cascading data analysis software. This announcement comes after the company took $4m from VCs to give it time to follow through on Wensel's ambition to "build a sustainable business around Cascading." ®

Build a business case: developing custom apps

More from The Register

next story
KDE releases ice-cream coloured Plasma 5 just in time for summer
Melty but refreshing - popular rival to Mint's Cinnamon's still a work in progress
Leaked Windows Phone 8.1 Update specs tease details of Nokia's next mobes
New screen sizes, dual SIMs, voice over LTE, and more
Secure microkernel that uses maths to be 'bug free' goes open source
Hacker-repelling, drone-protecting code will soon be yours to tweak as you see fit
Mozilla keeps its Beard, hopes anti-gay marriage troubles are now over
Plenty on new CEO's todo list – starting with Firefox's slipping grasp
Apple: We'll unleash OS X Yosemite beta on the MASSES on 24 July
Starting today, regular fanbois will be guinea pigs, it tells Reg
Another day, another Firefox: Version 31 is upon us ALREADY
Web devs, Mozilla really wants you to like this one
Cloudy CoreOS Linux distro declares itself production-ready
Lightweight, container-happy Linux gets first Stable release
prev story

Whitepapers

Implementing global e-invoicing with guaranteed legal certainty
Explaining the role local tax compliance plays in successful supply chain management and e-business and how leading global brands are addressing this.
Boost IT visibility and business value
How building a great service catalog relieves pressure points and demonstrates the value of IT service management.
Why and how to choose the right cloud vendor
The benefits of cloud-based storage in your processes. Eliminate onsite, disk-based backup and archiving in favor of cloud-based data protection.
The Essential Guide to IT Transformation
ServiceNow discusses three IT transformations that can help CIO's automate IT services to transform IT and the enterprise.
Maximize storage efficiency across the enterprise
The HP StoreOnce backup solution offers highly flexible, centrally managed, and highly efficient data protection for any enterprise.