HPE: Exafloppers need to be 'memory-centric' as world cannot afford internode data slinging
Got to keep purse-strings tight on that power budget, innit?
Analysis HPE Advanced Technologies veep Mike Vildibill has told El Reg that power budget limitations will end the current supercomputer model of having multiple independent nodes passing chunks of data between them.
HPE's exascale supercomputer thinking appears to be based on its Machine-like shared memory approach.
All the major supercomputer powers – China, Japan, the USA, and Europe – want to build exascale mega-machines capable of delivering 1 exaflop – that's one quintillion (1018) floating-point math operations every second, a billion billion flops.
Today's supercomputers are petaflops (1015) machines, with performance measured in units of a thousand million million flops. The exascale systems will be a thousand times faster.
The Intel-Cray Aurora system at the Argonne National Laboratory was going to run at 180 petaflops, with 50,000 separate x64 and third-generation Xeon Phi nodes, and an overall 13MW power consumption. Each node delivers, or would have delivered, 3.6 teraflops. However, Intel ran into a brick wall with the development of the Xeon Phi family, and has gone back to the drawing board. Thus, Aurora has been delayed from going live in 2018 to somewhere around 2021 or 2022 as Intel starts over.
The US Department of Energy (DoE) has a PathForward program to bankroll its Exascale Computing Project, and it awarded contracts to six vendors in June last year to fund exascale hardware, software, and application R&D. They were: AMD, Cray, HPE, IBM, Intel and Nvidia.
The PathForward program has the intention of delivering at least one exascale-capable system to American boffins by 2021. In March this year, a DoE budget request revealed three systems were being envisaged:
- Aurora – Intel/Cray-based to be delivered in 2021 at Argonne National Laboratory,
- Frontier – for 2021-2022 delivery to Oak Ridge National Laboratory,
- El Capitan – to be delivered to the Lawrence Livermore National Laboratory (LLNL) around 2023.
There are no suppliers pencilled in to deliver Frontier and El Capitan systems, and the six suppliers suggest three partnerships – Intel and Cray for Aurora being one, which leaves system suppliers HPE and IBM and CPU/GPU slingers AMD and Nvidia playing musical chairs.
The 2021 Aurora system has a 20 to 40MW power budget, and a nominal 50,000 node count, being a development of the original Aurora system.
From node-centric to memory-centric systems
HPE’s Vildibill said today's blueprints of tens of thousands of nodes, each with their own caches and memories and IO functions, won't reasonably scale to exascale.
You can’t take a 180 petaflops, 50,000 node, 13MW system and simply multiply it up to an exaflops beast. That would mean 5.6 times the number of nodes, 280,000, and, notionally, a 73MW power budget.
Something has to give to meet the DoE's exascale power budget limits.
A problem comes when software on one node needs data that isn’t in its memory but is in another node’s memory. That other node has to receive a request for the chunk of data, package it up, and send it to the requesting node. This takes time, CPU cycles, IO channel work, and power – to whit: time and electricity.
In HPE's designs for exascale systems, processing nodes still have their own local RAM, however, it is built directly on top of or next to the CPU cores, within the same processor package, and not on an external bus like DIMM memory. These nodes are then interconnected using photonics – using pulses of light, in other words, to directly transfer information in and out of the chips.
The combination of RAM and light-speed networking closely and physically coupled to the compute cores should, in theory, allow computers to scale to exascale levels without devouring too much electricity. Photonics connections need less power than copper wires, after all. HPE fellow and VP Paolo Faraboschi has envisaged 10 teraflop processors with 1TB/sec links to this local memory.
Smartphones and other compact devices tend to stick their RAM on top of their CPU cores these days, so it's not impossible to imagine by any stretch.
In the mind of HPE's eggheads, tens of thousands, if not hundreds of thousands, of nodes and their vast pool of memory would be connected via a HyperX fabric [PDF] and photonics-based routers with high bi-directional port counts. HPE Labs researchers are developing all-silicon photonics devices that can switch a laser's light between many optical channels, delivering data at high speed across a network of nodes.
HyperX features an extension of hypercube and flattened butterfly topologies, and a DAL adaptive routing algorithm. The topology involves grouping processors into so-called dimensions. Each processor in a dimension has a direct connection to every other processor in the dimension.
Think of a dimension as being like a butterfly's wing. Every dimension has a direct connection to every other dimension. The in-dimension hop count is 1, and this number increases between a processor in one dimension and a processor in another – but still, HPE said, less than alternative fabric topologies.
Code running on the processors therefore access the shared memory pool using the HyperX photonics fabric and load-store semantics. HPE envisages the Gen-Z memory-semantic protocol being used, with AMD being a Gen-Z consortium member. Memory-maker Micron is also a member, so perhaps it could provide the silicon-connected memory to be used with AMD CPUs in the co-packed device.
In HPE's exascale scheme, software running on node processors use and share memory pointers to access data whether it's stored locally in RAM or in another node. If the interconnect links are fast enough – shifting several hundred gigabytes per second – information can be loaded, computed, and stored transparently and atomically across nodes efficiently enough to avoid bottlenecking the system.
This approach means blocks of data do not have to be explicitly copied between nodes, processed, and committed back to memory using locks and other mechanisms to prevent data races. Instead, it should all happen quickly, transparently and atomically at the photonics-based fabric layer, which should, ideally, bring down power usage and latencies.
Essentially, HPE has to show that accessing RAM in a separate node is as fast as accessing it in the same node, when using its technology. After that, the CPU caches take over, and computations happen at high speed on the cached working set.
What all the above suggests is that HPE intends to dust off the remnants of its shared memory tech from its mothballed Machine research, knock it into shape, and deploy it in a DoE exascale project. We can imagine HPE picking AMD as its CPU partner and Micron as a memory supplier, as it throws its hat into the ring for either the Frontier or El Capitan system. ®