Greenplum opens up Big Data control freak: Chorus for all of us
Ties up with Kaggle to head hunt algorithm geeks
Hadoop World As promised, the Greenplum Big Data subsidiary of IT conglomerate EMC is opening up the Chorus control freak that it created to span the Greenplum data warehousing database and its two implementations of the Hadoop Big Data muncher.
At the Hadoop World extravaganza in New York, Greenplum is taking the wraps off the OpenChorus project, which is open-sourcing the Chorus control freak as Chorus Community Edition. Greenplum had promised to open up the Chorus code back when Chorus 2.0 was announced back in March of this year. That was also when Greenplum acquired Pivotal Labs, a hot-shot mercenary coding outfit that Greenplum hired to help it port the Chorus from Java to Ruby and get the project back on track after it was delayed. Greenplum liked the results so much that it bought the company for an undisclosed amount.
At the time, Greenplum did not divulge what licensing model it would use, but hinted that it would lean towards open licenses like Apache and away from more restrictive licenses like GPL. And, as it turns out, OpenChorus tapped the Apache 2.0 license for the freebie code. The open-source version is based on Chorus 2.1, and the OpenChorus project says that it is in the late stages of development for Chorus 2.2 at this time. The code is available at GitHub here.
Greenplum is very honest about that it intends for OpenChorus and said back in the spring that it did not expect a lot of developers to step up and contribute, as happens with the underlying Hadoop project and related tools, for instance. Rather, OpenChorus is emulating Android, where one vendor, in this case Google, does most of the work and the open sourcing is about making companies comfortable investing in the technology, not about getting them to code. Nothing will prevent Greenplum's competitors in Hadoop – Hortonworks, Cloudera, Teradata, and IBM – from snagging the code and using it or elbowing their way into the project, of course.
Greenplum will obviously continue to distribute a supported version of the tool, now to be known as Chorus Enterprise Edition, according to Josh Klahr, vice president of product management at Greenplum. The Chorus Community Edition will be distributed freely, but it will not have either updating features or tech support.
In addition to opening up the Chorus tool, Greenplum announced a series of partnerships with Kaggle, GNIP, and Tableau, which all have niches in the Big Data space.
Kaggle hosts data science competitions where some 57,000 algo freaks compete to try to solve problems for money. (It turns your job into a game show of sorts, but the problems involve big data and it is definitely not like a steady job.) The Chorus 2.0 tool allows for data warehouse and Hadoop admins to cordon off a chunk of a machine and sandbox it for algorithm writers to test their code against a subset of real data on real iron. In the long haul, Greenplum and Kaggle hope to integrate algorithm contests with Chorus so you can publish contests directly to Kaggle from the Chorus interface and dispatch work from data scientists who are tapped by Kaggle to run their algorithms. At the moment, the integration is a bit looser and more manual, allowing Chorus admins to package up the job around which they want to create a contest – the job description, the data types, and so on – and sent invitations to Kaggle for people to take a whack at solving the problem.
Greenplum is also working with GNIP, which dices, slices, and packages the full-on Twitter feed and resells it, so customers who have a GNIP account can suck in JSON-formatted datasets, drop them into Hadoop, and automatically see them pop up inside of Chorus for use in data munching. Eventually GNIP will provide access to raw feeds from YouTube, Flickr, Facebook, Google+, Tumblr, WordPress, and other social media sites so you can get pre-chewed versions of their feeds for munching.
Greenplum is also integrating Chrous with the multidimensional data visualization tools from Tableau Software. With the links between the two programs, Chorus will be able to grab data from Hadoop file systems and Greenplum databases and spit it out into Tableau workbooks and allow Chorus to tag and annotate Tableau assets as well.
The Chorus 2.3 update that features the Kaggle, GNIP, and Tableau integrations will be available in November. ®