Blades carve up Top 500 supercomputers list
Climbing the totem pole
Comment With much anticipated fanfare, the Top500 Supercomputer list was announced last week at the International Supercomputing Conference in Dresden, Germany.
For the fourth time, the BlueGene/L System development by IBM and DOE's National Nuclear Security Administration that is installed at DOE's Lawrence Livermore National Laboratory claimed the number one spot with a Linpack benchmark performance of 280.6 TFlop/sec.
The second and third-placed systems also exceeded 100 TFlop/sec: the upgraded Cray XT4/XT3 at DOE's Oak Ridge National Laboratory, with a benchmark performance of 101.7 TFlop/sec, and Sandia National Laboratory's Cray Red Storm system, rated 101.4 TFlop/sec.
Dell's Abe system at NCSA, built on 1,200 PowerEdge blades, ranked number eight overall, the highest performing blade solution. IBM's MareNostrum supercomputer, built on IBM BladeCenter JS21 blade servers, ranked number nine on the overall list to retain its position as the most powerful supercomputer in Europe.
The performance value required to make it onto the list increased to 4.005 TFlop/sec on the Linpack benchmark, compared with 2.737 TFlop/s six months ago. The system ranked number 500 on the current list would have held position number 216 just six months earlier. This is the largest turnover between lists in the Top500 project's 15 year history.
A total of 289 systems (~58 per cent) use Intel processors, 105 systems (21 per cent) feature AMD's Opteron family, and 85 systems (17 per cent) are based on IBM Power processors. Dual-core processors are the dominant chip architecture. Intel's Woodcrest dual-core chip showed the most growth, with 205 systems using this chip compared to 31 six months ago. Another 90 systems use Opteron dual core processors, up from 75 six months ago.
Clusters remain the most common architecture in the Top500 list. The list has 373 clusters, representing just under three-quarters of all systems. HP Cluster Platform 3000BL and 4000BL systems, based on HP BladeSystem c-Class servers, accounted for 152 entries on the list. HP has the highest number of systems on the list with 40.6 per cent, IBM now has 38.4 per cent, and Dell holds 4.8 per cent, with no other manufacturer having more than five per cent of the systems listed.
For overall total performance represented, IBM has 41.9 per cent of installed performance, HP follows with 24.5 per cent, next is Dell at nine per cent, Cray with 7.3 per cent, and SGI at 5.7 per cent. The average age of a system in the Top500 list is one year and two months with 60 per cent having been installed or upgraded this year and 23 per cent having been installed or upgraded last year.
Not surprisingly, the sheer amount of computational horsepower listed on the Top500 has broken another record, and the table stakes required to play in this game have jumped up about 50 per cent over six months ago. The industry's ability to continue to shatter past assumptions about performance limits has proven remarkably resilient and perhaps serves notice that we should simply stop trying to define what the limit of computing will ever be.
Nevertheless, what we find most interesting in the latest list is how prevalent, and how far up the totem pole, blade-based solutions have become. The Dell Abe system installed at NCSA ranked number eight overall, IBM's system in Barcelona followed in ninth place, and many of the entries from HP were based upon the BladeSystem c-Class.
These are remarkable achievements for an architecture that for many is becoming the de facto IT platform of the future. From an overall efficiency and cost perspective, this bodes well for the architecture and begs the question: in the world of HPC, does the blade server offer the best price per TFlop?
Granted, HPC systems are often highly customised, but the inherent reduction in cabling, power supplies, external switches, etc., would seem to translate even in an HPC scale. It would be interesting to see just how the efficiency of the blade architecture scales in the highest-performance environments.
Overall, we are impressed with the continued growth in HPC capability, and in particular with the solutions from IBM, HP, and Dell that are based upon technologies commonly available to mere mortals albeit in much smaller footprints. The value of clustering, blades, and virtualisation technologies is very evident in the Top500.
To our way of thinking, this should help keep vendors very interested in pursuing these technologies in their general market offerings which, after all, is where most vendors make the lion's share of their revenue. While HPC solutions of the highest scale typically do not deliver a profit margin that would keep most vendors afloat, the R&D investment has a substantial payback as the vendors bring the innovations that they develop in solving some of the computational most difficult tasks down to the general-purpose offerings upon which the industry and organisations thrive.
Copyright © 2007, The Sageza Group
Francis should write the next top 500 article.
Increase the network bandwidth, increase the terraflops
But Steve, remember what's in the likes of the monolithic supercomputers? Lots of deskside chips, all on a bus. Now that network bandwidth has increased dramatically, the blade systems become a better alternative. It doesn't take that much bandwidth to farm out the work packets and bring back the results. And what do you do with all of that equipment when you upgrade? You can part it out quite easily instead of trying to interest someone in a 1-ton beast that requires 5 tons of cooling.
Misses the point
There are some significant points being missed in the analysis of supercomputer architectures. Rather than concentrate on the ubiquity of x86 processors, and the number of blade based systems, there are deeper questions.
For many purchasers of research supercomputers it is all about research for the dollar. This then becomes a self re-enforcing paradigm. If you are a researcher whose problem is a good fit onto a commodity cluster, you get great research productivity, publish more, generally do well, and of course you either directly or indirectly influence the purchase of the next round of hardware. If you are a researcher whose problem is not a happy fit onto a commodity cluster you do badly, get poor research productivity, and get squeezed out. Eventually you will move to a non-computational area of research. Thus for almost any university based supercomputer a cluster is a given.
However if you have a problem for which the money for a supercomputer is not the issue, rather the simple need to get the job done, you buy the right machine for the problem. Here you see the big research labs - those with specific mandates for results in a particular area. So, atomic physics, weather and climate, protein folding etc. Look at those. You don't see a dominance of commodity clusters. You see a range of architectures many, without x86 processors, and often sporting exotic memory and interconnect architectures. They are bought because they are the right machine.
Many of the machines in the top-500 will never be used as a single machine to solve a problem. Rather they are task farms or a used as a large number of small clusters. Serving a large number of concurrent users. The one and only time they are used for a single job is when benchmarked to get a listing in the Top-500. Further, the Top-500 is seriously biased in favour of clusters. It uses a single very simple benchmark, based on Linpack. This benchmark is a very poor indicator of performance on real world problems (except for Linpack). If you have enough memory on each node it is almost totally insensitive to the interconnect latency, which is ludicrous. Careful tuning of the parameters (which is allowed by the Top-500 rules) means you can tune the blocking factors to match the cache sizes, and pretty much tune out the memory bandwidth as a limiting factor. Further, it does not make any measure of IO speed, or problems that cannot be easily blocked for good cache performance. In short the benchmark is, by an unfortunate quirk of history, almost perfectly matched to show a simple cheap cluster in the best light.
So those architectures that pay great attention to these issues, real HPC issues, at significant cost, are heavily disadvantaged. Almost everyone in the HPC arena will acknowledge that the Top-500 listing is mostly bogus, with a very poor relationship to real performance on real problems, but everyone wants to see their new machine in there, so no-one has managed to overturn it. Despite some good work on better benchmarks. Years ago a colleague dubbed this behaviour as "Gigaflop harlotry." One of the most apt phrases I have ever heard applied.
(As a closer, one notes that the notion of virtualisation is cited as a significant player in these HPC systems. Which it most certainly is not. There is no role for virtualisation whatsoever. We do not want fractions of a processor shared across lots of problems, we want lots of full processors unified to solve one problem. In any HPC system, one of the first things one does with a hyperthread capable Xeon is turn hyper-threading off. Leaving it on sucks a small but noticeable amount of performance. Adding a hypervisor onto the base system would be insanity.)