Feeds

EMC wants to be the Linux of big data

Opens up Chorus tool, borgs agile coders Pivotal Labs

Beginner's guide to SSL certificates

To broaden its reach in the big-data arena, disk-array maker EMC's Greenplum division, which peddles data warehousing and Hadoop appliances and software, announced that it will open source its Chorus management and collaboration tools. EMC also has acquired Pivotal Labs, experts in agile programming, to help it build better big-data software and, equally importantly, help others do so.

EMC has always been serious about data, but in case you haven't noticed it, the company is now very serious about big data and the software that is used to chew it up and regurgitate useful bits of information.

"Having database-kernel developers doing a UI was not working out really well," conceded Luke Lonergan, CTO at the Greenplum division to El Reg in an interview after EMC made its announcements in a webcast presentation hosted in San Francisco and New York.

About a year ago, Greenplum hired Pivotal Labs, which was founded in 1989 and which has a couple hundred code-slingers that could teach the database programmers some new tricks. They got the Chorus product back on track, and then EMC pulled a Victor Kiam and liked the company so much it bought it today for an undisclosed sum.

Greenplum previewed the new Chrous 2.0 tool in December 2011, it being a central feature of its Unified Analytics Platform. The idea is to take data warehouses running the Greenplum variant of PostgreSQL and Hadoop clusters running either Greenplum HD (the open source distro) or Greenplum MR (the open-core version from MapR Technologies that EMC resells) and mash them up and glue them together using the Chorus collaboration environment.

EMC president Pat Gelsinger

Gelsinger: Open source Chorus 'is a big step for us'

Chorus 2.0 has a Facebook-style collaboration interface to data sets and analytics tools so people can share data. It also has a full metadata search so researchers can do data exploration in either structured or unstructured data.

Equally importantly, Chorus 2.0 can spin up a sandbox inside a data warehouse or Hadoop cluster, or spin up a data mart inside of a VMware virtual machine, so different "data scientists" can chew on different parts of the data and not create physically separate data silos running on other machines.

The current Chorus 1.2 does not know how to talk to Hadoop, and it can't spin up a personal sandbox for an analyst. Chorus 2.0 will also have integrated data visualization tools to help analysts and other big-data users get a feel for the shape of the data so they know where they might need to drill down more to try to understand some aspect of their business better.

Chorus 2.0 has been in beta testing for the past four months, says Lonergan, and during a tour of the Pivotal Labs facility in San Francisco that was part of the webcast, one of the code-slingers said that the product was in release-candidate phase right now. Lonergan later confirmed to El Reg that Chorus 2.0 will ship on March 23.

During that tour of Pivotal Labs – the company also has offices in New York and had an office in Singapore for a while – it was shown how the company has teams of a dozen or so people coding away on projects with pairs of programmers coding together on parts of the code.

Musical chairs

Every day or so, the programmers play musical chairs, and over the course of a week or so, everyone has been teamed up with everyone else on that development team – the Chorus team, for example, has ten people on it.

The idea is that both coders in a pair do some programming, and no one programmer becomes a subject-matter expert on any piece of the code. Everyone gets to know all of the code this way – not by studying it, but by working on it.

Every time the code changes, a build is done to the code. If it fails any tests, it is immediately flagged as failing and everyone on the team can see the issue – there is tremendous peer pressure to get the code fixed. You make iterative changes in the code, and you fix things as you go along rather than waiting until the end of a protracted development process.

EMC did not disclose the price it paid to acquire Pivotal Labs, but said that the company would remain an independent unit, much as Greenplum, VMware, RSA Security, and others have been left reasonably untouched by the EMC mothership after being acquired.

Pivotal Labs is privately held and sells a tool called Pivotal Tracker that is a scheduling system for agile programming, forcing developers to program down into small chunks, called stories, that they work on in teams. There are 240,000 developers using the Pivotal Tracker tool today, and EMC said in a statement that it was committed to investing in this tool and letting Pivotal Labs do what it does.

Pivotal Labs is big on Ruby on Rails. In fact, according to Lonergan, it has been instrumental in getting Greenplum to port the Chorus tool from the Java back-end used with the 1.2 release to Ruby on Rails with the 2.0 release.

Scott Yara, senior vice president of products at the Greenplum unit, said that as Greenplum got exposed to the coders at Pivotal Labs and the new techniques, its own programmers starting thinking outside of the box about Chorus, social media, open source, and what the product could be.

As far as bringing social media to the Chorus tool, which the company started mulling four years ago, before EMC even came a-calling, Yara said that this "seemed like a stretch."

But as time went by, "people kept pushing us," said Yara, and they started thinking about the big platforms that have established themselves in the past couple of years – Linux, Java, Hadoop, and Android, just to name a few – and they all have one thing in common: they are open source. And thus the idea was born to take the Chorus tool open source and position it as a platform for integrating big-data applications.

"This is a big step for EMC," explained Pat Gelsinger, president and COO of EMC's Information Infrastructure Products group, which includes Greenplum and a bunch of other products. "We've helped open source, but we have never been open source."

EMC did not provide a lot of details about the OpenChorus project, but the company said that it planned to have the code open sometime in the second half of this year.

Unlike Hadoop and other big-data projects, where the open sourcing was done to solicit help with actually completing the code and ruggedizing it for commercial use, EMC said that it was taking the Java and Android models, where the development work would be done largely by the sponsoring company.

The opening up of the Chorus source code is about making companies comfortable in investing in Chorus – they know it can survive any vendor – and getting developers to code applications that work through it and bring extensions to the tool itself. EMC is not looking for help on coding Chorus per se, but it sounds like it could have used some.

Lonergan would not reveal if EMC has made a decision about what license under which the Chorus tool will be distributed, but he hinted that the kind of "open" licenses used by Apache projects were appealing and the more restrictive GNU General Public License was not. "Our objective is to have a license that makes this partner-friendly and community-building," Lonergan said.

It will be interesting to see how other big-data players – IBM, Oracle, Teradata, and a slew of other smaller players such as Cloudera, Hortonworks, and so on – will participate in the OpenChorus community and link their products into the tools. Maybe they will play, and maybe they won't. ®

Top 5 reasons to deploy VMware with Tegile

More from The Register

next story
Just don't blame Bono! Apple iTunes music sales PLUMMET
Cupertino revenue hit by cheapo downloads, says report
The DRUGSTORES DON'T WORK, CVS makes IT WORSE ... for Apple Pay
Goog Wallet apparently also spurned in NFC lockdown
Cray-cray Met Office spaffs £97m on VERY AVERAGE HPC box
Only 250th most powerful in the world? Bring back Michael Fish
Microsoft brings the CLOUD that GOES ON FOREVER
Sky's the limit with unrestricted space in the cloud
'ANYTHING BUT STABLE' Netflix suffers BIG Europe-wide outage
Friday night LIVE? Nope. The only thing streaming are tears down my face
IBM, backing away from hardware? NEVER!
Don't be so sure, so-surers
Google roolz! Nest buys Revolv, KILLS new sales of home hub
Take my temperature, I'm feeling a little bit dizzy
prev story

Whitepapers

Cloud and hybrid-cloud data protection for VMware
Learn how quick and easy it is to configure backups and perform restores for VMware environments.
Forging a new future with identity relationship management
Learn about ForgeRock's next generation IRM platform and how it is designed to empower CEOS's and enterprises to engage with consumers.
High Performance for All
While HPC is not new, it has traditionally been seen as a specialist area – is it now geared up to meet more mainstream requirements?
New hybrid storage solutions
Tackling data challenges through emerging hybrid storage solutions that enable optimum database performance whilst managing costs and increasingly large data stores.
Security and trust: The backbone of doing business over the internet
Explores the current state of website security and the contributions Symantec is making to help organizations protect critical data and build trust with customers.