Feeds

EMC cranks Greenplum database to 4.2

Goosing Hadoop links and warehouse backups

5 things you didn’t know about cloud backup

Big data is not just a problem because it is big, but because it keeps swelling. That goes as much for traditional data warehouses as it does for more modern Hadoop MapReduce data munchers. And with the latest update of its eponymous database, the Greenplum division of IT conglomerate EMC has made some tweaks to its homegrown database to make wrestling with big data a bit easier.

The Greenplum Database is available in two forms, just like its predecessor. One runs on Greenplum's own hardware appliance (which is based on hardware from an unspecified server OEM partner), and the other is a software-only distribution that customers can run on any x86 server machine that supports Red Hat Enterprise Linux, Oracle Solaris, or Apple's OS X.

The Greenplum database is a parallelized and heavily customized version of the open source PostgreSQL database, and has been optimized for ad hoc queries instead of transaction processing. It is a massively parallel, shared-nothing database and has "polymorphic data storage" to allow database administrators to carve up a range of data in a database table and choice the row or column orientation a query should use as well as what storage, execution, or compression settings that should apply to this segment of data.

Like other data warehouse engines, the Greenplum Database is a heavy user of data compression to speed up queries and reduce disk storage capacity needs.

Greenplum's Hadoop distributions are similarly available on the same hardware appliances – with some tweaks – as well as a software-only product that can run on any Linux-based x86 server.

Back in December, Greenplum unveiled its long-range plan to mashup its data warehousing and Hadoop stacks to create a giant data muncher called the Unified Analytics Platform.

Greenplum Database 4.2

The building blocks of the Greenplum Database

With Greenplum Database 4.2, the EMC unit is doing a few different things. First, on as it promised back in December, Greenplum has tweaked its parallel data warehouse loading technology, called gNET, so it can import and export data in parallel from a warehouse to a Hadoop cluster.

Equally significant is that the gNET feature in the 4.2 release of the relational database actually allows for gNET to reach into the Hadoop cluster and query data right where it is sitting, using some of the Hadoop cluster resources instead of burdening the iron running the data warehouse.

"This used to be a read-only tool," explains Mike Maxey, senior director of product marketing at the Greenplum. "Now it leaves more data in Hadoop and does more processing inside Hadoop."

Greenplum Database 4.2 also includes a new management console called Command Center, which replaces an older tool called PerfMon that database admins have been using up until now. Maxey says Command Center, unlike PerfMon, is a Web-based tool and has more functions that database admins have been looking for, such as the ability to start, stop and initialize databases on the fly, recover and rebalance database mirrors, and search, prioritize, or cancel any query on the system.

Command Center is also able to reach out across the network into a Greenplum HD or MR Hadoop cluster and check the state of the cluster from inside this console. "Over time, Command Center will evolve to have broader and deeper coverage of the database and Hadoop platforms," says Maxey.

The initial release of Command Center is available initially with the Data Computing Appliance 1.2 system, and will eventually be available in the software-only distribution.

The 4.2 release of the database has the requisite performance tweaks, including dynamic partition elimination and query memory optimization. The database also has a new package manager that does automatic installation and updating of extensions to the database on a running system with multiple nodes and different features running hither and yon.

Finally, EMC has integrated its Data Domain Boost data de-duplication backup software with Greenplum Database 4.2. In benchmark tests, EMC was able to back up a 173TB data warehouse in under less than eight hours. This was achieved by spreading parts of the Data Domain de-duping operations over the data warehouse nodes in an appliance, thus parallelizing the massive job and making the backup run faster because the de-duping was faster.

At the Strata Conference today in Santa Clara, California, in addition to launching the new database release, Greenplum is also talking up its ability to run Greenplum MR Hadoop atop Cisco Systems' C-Series rack-based servers. El Reg already told you all about that two weeks ago. ®

Secure remote control for conventional and virtual desktops

More from The Register

next story
BBC: We're going to slip CODING into kids' TV
Pureed-carrot-in-ice cream C++ surprise
China: You, Microsoft. Office-Windows 'compatibility'. You have 20 days to explain
Told to cough up more details as antitrust probe goes deeper
Linux turns 23 and Linus Torvalds celebrates as only he can
No, not with swearing, but by controlling the release cycle
Scratched PC-dispatch patch patched, hatched in batch rematch
Windows security update fixed after triggering blue screens (and screams) of death
Windows 7 settles as Windows XP use finally starts to slip … a bit
And at the back of the field, Windows 8.1 is sprinting away from Windows 8
This is how I set about making a fortune with my own startup
Would you leave your well-paid job to chase your dream?
prev story

Whitepapers

Implementing global e-invoicing with guaranteed legal certainty
Explaining the role local tax compliance plays in successful supply chain management and e-business and how leading global brands are addressing this.
Endpoint data privacy in the cloud is easier than you think
Innovations in encryption and storage resolve issues of data privacy and key requirements for companies to look for in a solution.
Why cloud backup?
Combining the latest advancements in disk-based backup with secure, integrated, cloud technologies offer organizations fast and assured recovery of their critical enterprise data.
Consolidation: The Foundation for IT Business Transformation
In this whitepaper learn how effective consolidation of IT and business resources can enable multiple, meaningful business benefits.
High Performance for All
While HPC is not new, it has traditionally been seen as a specialist area – is it now geared up to meet more mainstream requirements?