The Register® — Biting the hand that feeds IT

Feeds

What should replace Linpack for ranking supercomputers?

Our Big Iron man has a few ideas following Top500

Free ESG report : Seamless data management with Avere FXT

ISC 2013 The Linpack Fortran benchmark, which has been used to gauge the relative performance of workstations and supercomputers for many decades, is looking a little long in the tooth. So, some of the people who love Linpack and know it best are proposing a new benchmark - with the mind-numbing name of High Performance Conjugate Gradient, or HPCG.

All system benchmark tests run their courses, and the most successful ones usually overstay their welcome and stop doing what they are designed to do. That is, allow for the comparative ranking of systems in terms of performance across different architectures and software stacks in a meaningful way, and - if we are lucky - provide some insight into bang vs buck for different systems so that companies can weigh the relative economic merits of systems that can run a particular application.

Tests run on supercomputers mostly do the former, and rarely do the latter. That is a big problem that Michael Heroux, of Sandia National Laboratories, and Jack Dongarra, of both the University of Tennessee and Oak Ridge National Laboratory, tackled in their HPCG benchmark proposal paper, which you can read here (PDF).

Earlier this week, your friendly neighborhood El Reg server hack defended the Linpack test as the Top500 supercomputer rankings for June 2013 were announced at the International Super Computing conference in Leipzig, Germany. Linpack is important because it is fun to see the rankings twice a year, and it is an easy enough test to run that people actually do it with some enthusiasm on their machines.

Erich Strohmaier, one of the administrators of the Top500 who hails from Lawrence Berkeley National Laboratory, tells El Reg that there were around 800 submissions of Linpack reports for the June Top500 list, and a bunch of them are tossed out because there is something funky in them. Making it to the Top500 list is a big deal for a lot of supercomputer labs, and every national and district politician around the world loves to see a new machine come online in their sphere of influence for the photo-op to prove they are doing their jobs to protect our future. And moving down in the rankings is also a big deal, because if you are not moving ahead you are falling behind. Jysoo Lee, director of the KISTI supercomputing center in Korea found this out when Korea's relative rankings slipped in the June list. You get phone calls from people in power who are not pleased.

So Linpack matters and it doesn't matter at the same time. It is a bit like miles per gallon fuel efficiency ratings for cars, or if you live in New York, letter grade ratings for restaurants that the city mandates.

But when it comes to using Linpack as a relative performance metric, there are some issues. If you really want to hurt your head, delve into the HPCG paper put out by Heroux and Dongarra. The basic gist, as this server hack understands it, is that the kind of codes that were initially deployed on parallel clusters fifteen years ago bore more of a relationship to High Performance Linpack, or HPL, than they do in all cases today. (HPL is the parallel implementation of Linpack; earlier versions could only run on a single machine like a PC at the low end or a federated SMP server with shared memory or a vector processor at the high-end.) This is one of the reasons why the University of Illinois has refused to run Linpack on the "Blue Waters" ceepie-geepie built by Cray. (Remember, Linpack is voluntary. It is not a ranking of the Top500 supercomputers as much as it is a ranking of the Top500 supercomputers from organizations that want to brag.)

Here's the central problem as Heroux and Dongarra see it:

"At the same time HPL rankings of computer systems are no longer so strongly correlated to real application performance, especially for the broad set of HPC applications governed by differential equations, which tend to have much stronger needs for high bandwidth and low latency, and tend to access data using irregular patterns. In fact, we have reached a point where designing a system for good HPL performance can actually lead to design choices that are wrong for the real application mix, or add unnecessary components or complexity to the system."

There is a lot of deep technical description in the paper, but here is the really simplified version. Linpack is a set of calculations using matrix multiplication that scales up the size of the arrays to try to choke the machine and force it to reach its peak capacity across all of the computing elements in the cluster. The original Linpack from the dawn of time solved a dense matrix of 100 x 100 linear equations, then it moved to 1000 x 1000 as machines got more powerful, and then the Linpackers took the hood off and let the linear equation count scale and tweaked the HPL code to run across clusters and take advantage of modern networks and interconnects as well as coprocessors like Intel Xeon Phi and Nvidia Tesla cards.

The kind of calculations and data access patterns in the Linpack test are what Heroux and Dongarra refer to as Type 1. The machines are loaded up with lots of floating point processing and the data can be organized in such a way as to be particularly efficient at getting the right data to the right FP unit at the right time most of the time. With Type 2 patterns - which are more reflective of the kinds of differential equations increasingly used in simulations - the data access patterns are less regular and the calculations are finer-grained and recursive. Those shiny new Xeon Phi and Tesla accelerators are really good at Type 1 calculations and not so hot (yet) on Type 2 calculations.

The upshot, say Heroux and Dongarra, is that designing a system to reach one exaflops on Linpack might result in a machine that is absolutely terrible at running real world applications, thus defeating one of the primary purposes of a benchmark test. Benchmarks are a feedback loop into system designers, and are absolutely necessary. Here is an example of the divergence cited by Heroux and Dongarra on the "Titan" Opteron-Tesla ceepie-geepie down in Oak Ridge National Laboratories, with which they are intimately familiar.

Read this carefully:

"The Titan system at Oak Ridge National Laboratory has 18,688 nodes, each with a 16-core, 32 GB AMD Opteron processor and a 6GB Nvidia K20 GPU. Titan was the top-ranked system in November 2012 using HPL. However, in obtaining the HPL result on Titan, the Opteron processors played only a supporting role in the result. All floating-point computation and all data were resident on the GPUs. In contrast, real applications, when initially ported to Titan, will typically run solely on the CPUs and selectively off-load computations to the GPU for acceleration."

5 ways to reduce advertising network latency

Whitepapers

5 ways to reduce advertising network latency
Implementing the tactics laid out in this whitepaper can help reduce your overall advertising network latency.
Supercharge your infrastructure
Fusion­‐io has developed a shared storage solution that provides new performance management capabilities required to maximize flash utilization.
Avere FXT with FlashMove and FlashMirror
This ESG Lab validation report documents hands-on testing of the Avere FXT Series Edge Filer with the AOS 3.0 operating environment.
Reg Reader Research: SaaS based Email and Office Productivity Tools
Read this Reg reader report which provides advice and guidance for SMBs towards the use of SaaS based email and Office productivity tools.
Email delivery: 4 steps to get more email to the inbox
This whitepaper lists some steps and information that will give you the best opportunity to achieve an amazing sender reputation.

More from The Register

next story
Dedupe-dedupe, dedupe-dedupe-dedupe: Flashy clients crowd around Permabit diamond
3 of the top six flash vendors are casing the OEM dedupe tech, claims analyst
Disk-pushers, get reel: Even GOOGLE relies on tape
Prepare to be beaten by your old, cheap rival
Hong Kong's data centres stay high and dry amid Typhoon Usagi
180 km/h winds kill 25 in China, but the data centres keep humming
Microsoft lures punters to hybrid storage cloud with free storage arrays
Spend on Azure, get StorSimple box at the low, low price of $0
WD unveils new MyBook line: External drives now bigger... and CHEAP
Less than £0.04/GB, but it loses the Thunderbolt speed
VMware vSAN test pilots: Don't panic but there's a chance of DATA LOSS
AHCI SATA controller won't play nice with Virtzilla's robo-storage beta
Pure poaches NetApp preacher
Stewart dumps disk array drama to fluff flash
StorNext gets revamp, Quantum claims 5x data throughput boost
Multi-threaded code, flash, metadata redesign and Infiniband support
prev story