Feeds

Evergrid ships 'little blue pill' for clusters

Don't let CD get you down

High performance access to file storage

SC06 A small software start-up thinks it might have "the little blue pill" necessary to keep massive clusters up and running at all times.

Evergrid this week unveiled something called the Availability Management Suite, but you might as well call it cluster Viagra. The company’s software works to make sure that minor – or major – system failures don't hamper the overall completion of large computing jobs. In total, Evergrid helps you keep your cluster up, sturdy and strong for hours and even days at a time.

Okay, we'll stop the bad "jokes" for a moment and get to the point.

Evergrid is working off the premise that the high performance computing industry lacks the proper tools for restarting jobs on large clusters. Systems made up of thousands of boxes will have failures – plenty of them – and need a way to get jobs restarted automatically.

Today, administrators must keep a close eye on their clusters, break jobs down into different chunks or try running jobs on smaller systems in order to avoid the pains of system failures. Evergrid tries to sidestep all this mess by running continuous "checkpoints" that capture the state of servers and their applications. If a failure occurs, the system can roll back to the last known state and then get cranking away on jobs once again.

Some companies and labs have developed their own checking systems over the years, but Evergrid believes that the industry is begging for a standard here.

Thankfully, customers will not need to rewrite any of their own software to make Evergrid's code work. The Evergrid "abstraction layer" slides in between an OS kernel and its applications. The checkpoints can then gather information on the state of memory, file I/O and the network at desired intervals.

Evergrid CEO's Dave Anderson, speaking here at the Supercomputing conference, told us that the monitoring software could chew through as much as 5 per cent of a system's resources. He, however, pitched that as an Armageddon type scenario and claimed that more often than not administrators will not notice the Evergrid code.

Come January, Evergrid plans to GA its Availability Services software and Resource Manager. Together, these products form the "flagship" Availability Management Suite.

The Availability package performs the checkpoint operations, while the resource manager handles a broad array of tasks such as making sure certain jobs get a set amount of processing power and priority levels.

We caught a demo of the Suite, and it worked as billed. An Evergrid administrator had three jobs spread across an eight-server cluster. The software could detect idle systems, throw them at jobs as needed and caught a failure when we ripped out one server. It stopped the one job that was affected and then got it up and running again in a couple of seconds on the available hardware.

By the second half of next year, Evergrid hopes to extend out of the HPC market and target smaller business clusters. It's looking, in particular, at the database and application server markets. Such customers might flock to Evergrid's tools for stopping jobs and then restarting them on new servers. This would allow a company to juggle different jobs with more flexibility than they have today.

For now, however, Evergrid will center on the HPC crowd that has enormous clusters built out of thousands of machines. Many HPC users deals with jobs that take days, weeks and sometimes months to process. They often have to restart these jobs from scratch due to system failures.

Evergrid has one patent and another handful of patents pending that it thinks will protect its IP from overzealous Linux coders who might come up with something similar. Some of you will be familiar with the company's CTO Dr. Srinidhi Varadarajan who built the massive G5 cluster at Virginia Tech and is credited with banging out much of the special "checking" sauce.

Evergrid claims two major customers at this point – an unnamed financial services company and the University of Oklahoma. The company has yet to set official pricing for its software, although Anderson guessed it will come in around "$250 per node with large volume discounts."

There's more information available here. ®

High performance access to file storage

More from The Register

next story
Seagate brings out 6TB HDD, did not need NO STEENKIN' SHINGLES
Or helium filling either, according to reports
European Court of Justice rips up Data Retention Directive
Rules 'interfering' measure to be 'invalid'
Dropbox defends fantastically badly timed Condoleezza Rice appointment
'Nothing is going to change with Dr. Rice's appointment,' file sharer promises
Cisco reps flog Whiptail's Invicta arrays against EMC and Pure
Storage reseller report reveals who's selling what
Bored with trading oil and gold? Why not flog some CLOUD servers?
Chicago Mercantile Exchange plans cloud spot exchange
Just what could be inside Dropbox's new 'Home For Life'?
Biz apps, messaging, photos, email, more storage – sorry, did you think there would be cake?
IT bods: How long does it take YOU to train up on new tech?
I'll leave my arrays to do the hard work, if you don't mind
prev story

Whitepapers

Securing web applications made simple and scalable
In this whitepaper learn how automated security testing can provide a simple and scalable way to protect your web applications.
Five 3D headsets to be won!
We were so impressed by the Durovis Dive headset we’ve asked the company to give some away to Reg readers.
HP ArcSight ESM solution helps Finansbank
Based on their experience using HP ArcSight Enterprise Security Manager for IT security operations, Finansbank moved to HP ArcSight ESM for fraud management.
The benefits of software based PBX
Why you should break free from your proprietary PBX and how to leverage your existing server hardware.
Mobile application security study
Download this report to see the alarming realities regarding the sheer number of applications vulnerable to attack, as well as the most common and easily addressable vulnerability errors.