Bright Computing revs up cluster manager
Fast provisioning, new Linuxes, CUDA, vSMP, and Python scripts
Bright Computing, which started from scratch several years ago to create a new, integrated cluster management tool, continues to build out the capabilities of the Bright Cluster Manager with the 5.2 release of the software, announced today at the International Super Computing 2011 conference in Hamburg, Germany.
Matthijs van Leeuwen, CEO at Bright Computing, says that the company continues to have an advantage over rivals in the open source and proprietary cluster management area in that the company is not taking system management and monitoring tools like Nagios and Ganglia and trying to extend them out to cover clusters instead of systems, but rather created a brand new cluster manager, which launched in 2009 after several years of development, that can hook into other workload managers and grid schedulers to babysit them and make them behave.
Interestingly, Bright has no interest in creating such tools itself. "There is such a wide choice of workload managers already," van Leeuwen tells El Reg. "Writing a workload manager is a lot of work. We want to focus on cluster management and then expand into other areas, such as clouds."
The company's strategy is to borg as many different workload managers, schedulers, and other bits of software that HPC shops want to use on their clusters and be the most inclusive and most integrated cluster manager out there. Bright Cluster Manager 5.2 wraps around more tools than its predecessors.
The release now supports Nvidia's CUDA 4.0 parallel programming environment and can extract metrics out of its Tesla GPUs and show what GPUs are doing out there on the cluster through the GUI console for the cluster manager.
BCM 5.2 also supports the SLURM, which is short for Simple Linux Utility for Research Management, an increasingly popular open source workload manager for Linux clusters that was created by the US Department of Energy supercomputing labs for their own use.
"Now that Grid Engine's future is uncertain, SLURM is becoming a default for people who want a free workload manager," says van Leeuwen, referring to the workload manager that Sun acquired several years ago before itself getting eaten by Oracle.
In addition to supporting SLURM in terms of integration with BCM 5.2, Bright will also provide tech support for SLURM if customers want that. (You can't exactly ring up Lawrence Livermore National Lab and ask for a bug fix.) The updated cluster manager from Bright also adds support for sometime rival Platform Computing's Load Sharing Facility (LSF) workload manager. BCM already supported PBS Professional, Torque/Moab, Torque/Maui, and Grid Engine as workload managers.
The 5.2 release from Bright also now has Python interfaces into its SOAP API set, allowing for Python script kiddies to get at all the features in the cluster manager programmatically. BCM is itself written in C++ and its initial SOAP API stack provided C++ interfaces; over time, the company has added support for Perl and PHP interfaces in the SOAP stack.
The new Bright Cluster Manager Web portal is written in PHP, and the PHP APIs allow BCM users to easily customize that portal. This portal, which is intended for users who want to see how their jobs are doing as they run on the cluster, keeps users from mucking about with the more powerful GUI that cluster administrators use to manage the clusters. Up until now, admins and users had access to all the same features, and that was, er, not too bright.
The new release of Bright Cluster Manager also knows how to wrap around ScaleMP's vSMP systems software, which is used to create a virtual symmetric multiprocessor out of Linux server nodes and which is increasingly used by some HPC shops to create fat memory nodes on the fly.
"We make vSMP really easy now," says van Leeuwen. "With a few mouse clicks, you can set up a shared memory system inside your cluster." BCM 5.2 provisions, monitors, and manages these vSMPs and can tell workload managers when to schedule jobs on them.
Red Hat Enterprise Linux 6.0, Scientific Linux 6, and SUSE Linux Enterprise Server 11 SP1 are all supported on BCM 5.2, which can provision nodes based on these recent Linuxes as well as RHEL 4 and 5, CentOS 4 and 5, and SLES 9 and 10. BCM does not, as yet, do Windows, or rather, Windows HPC Server 2008 R2, but it might if Windows takes off more in HPC data centers.
Big user drives requirements
Bright Computing is a spinout of ClusterVision, a supercomputer reseller based in Amsterdam, the Netherlands, that built the tool to manage clusters it sells to customers in Europe where it peddles clusters. ClusterVision decided to take BCM to a broader set of machines and set up shop for Bright Computing in San Jose.
Van Leeuwen is a bit cagey about how many customers Bright Computing has, but he says it is above 100 and below 1,000. Sandia National Lab and Lawrence Berkeley National Lab each run one of their big clusters with BCN, and the Texas Advanced Computing Center has five clusters managed by BCM. Dell is a reseller, and so is Cray, although Cray is using it to manage storage, oddly enough. The company has added six resellers across China and Japan and will be coming to India soon, says van Leeuwen.
One of Bright Computing's customers, an unnamed manufacturer in the United States, runs BCM on a 500-node cluster, and drove a lot of the requirements in the 5.2 release, including being able to provision and run nodes with memory-based RAMdisks instead of disk drives. Node provisioning is the first bottleneck, of course, and that big manufacturer driving requirements wanted to be able to provision a node in 15 seconds and be able to have a workload scheduler dispatch work to it in under 15 minutes – which Bright Computing says it can do with BCM 5.2.
Bright Cluster Manager has been tested managing as many as 5,000 nodes in the labs, and van Leeuwen says that he is confident that it could easily scale to 10,000 nodes and further. "There are very few meaningful applications that scale to that size," he says.
BCM is a closed-source program and comes in two versions. The Standard Edition is aimed at Platform Computing's Platform Cluster Manager, and sells for $150 per node. The Advanced Edition sports more features including redundant head nodes, cluster health management, and offloadable provisioning to distribute management work but keep it synchronized across multiple cluster management daemons (CMdaemons in the Bright lingo); it costs $195 per node. ®