Original URL: http://www.theregister.co.uk/2012/06/19/sgi_uv_2000_xeon_super/

UV 2: RETURN of the 'Big Brain'. This time, it's affordable

Hefty loads bursting out of your box? Try this

By Timothy Prickett Morgan

Posted in Data Warehousing, 19th June 2012 07:21 GMT

Silicon Graphics is betting big on Intel's latest Xeon E5-4600 processor and its own revved up NUMAlink 6 shared memory interconnect, creating a "big brain computer" that can gang up to 4,096 cores into a single system image to run massive Linux workloads and fairly large Windows jobs, too. The new UV 2 is exactly the kind of box, says SGI, that customers with big data warehouse, big database, big data, and traditional HPC workloads have always wanted – and in many cases could never have afforded.

But the shift to new packaging and lower-cost Xeon E5 processors from Itanium and then Xeon E7 chips from Intel have made the shared memory systems from SGI more broadly accessible at just the same time that many workloads seem to be busting out of general-purpose four-socket boxes. This is good news for SGI, which has had its share of financial woes as it chases the capricious and fiercely competitive HPC and hyperscale data center markets.

SGI will also be pleased to note that Intel has not yet got interconnect fabrics woven into its Xeon processors and chipsets, although it is clearly working on that with the acquisition of Cray's family of HPC interconnects back in April, its purchase of the InfiniBand chip and switch business from QLogic in January, and the Ethernet switch chip business Fulcrum Microsystems back in July 2011.

However, SGI still has a good window in which to capitalize on its NUMAlink interconnect before Intel does whatever it's going to do to integrate interconnects with its CPUs and chipsets. It would not be surprising to see SGI sell the NUMAlink biz to Intel for a big chunk of change, or maybe even an acquisitive Advanced Micro Devices or Hewlett-Packard. In fact, it would not be surprising at all if HP just upped and bought SGI to get out of its Itanium conundrum with Oracle. But so far, SGI seems content to go it alone and to peddle rack and shared memory systems all by its lonesome.

A rack's worth of SGI's UV 2000 supercomputer

A rack's worth of SGI's UV 2000 supercomputer

SGI put out a bit of a preview on the UV 2 lineup when Intel launched the Xeon E5-4600 processors a little more than a month ago. At the time, the company said that it was switching away from the Xeon 7500 and E7 and their multiple QuickPath Interconnect (QPI) ports. SGI had also said it was moving away from the "Boxboro" 7500 chipset that it had used to interface with the NUMAlink 5 interconnect for lashing nodes tightly together in a memory-coherent fashion. The UV 1000 high-end machines were based on a two-socket blade.

The Xeon 7500 and E7 chips have four QPI ports coming off each socket, and the original UV 1000 design used two QPI ports on the Xeon 75000 or E7 chips to cross-link the two sockets together, with one of the remaining two QPI ports going to the Boxboro chipset (which controls access to main memory and local I/O slots on the blade) and the other that links out to the NUMAlink 5 hub, which in turn has four links out to the NUMAlink 5 router. That router implements an 8x8 (paired node) 2D torus that can deliver up to 16TB of shared space across those 256 sockets.

While SGI let it be known a month ago that it was ditching the Xeon E7s for the E5-4600s in the next-generation UV 2000 shared memory supers, the company did not say exactly how it was going to build these machines. (SGI had to save a little something to talk about at the International Super Computing conference in Hamburg, Germany this week, after all.) El Reg speculated that there would be a goosed interconnect and that SGI would stick to two-socket blades. We were right on the first count, but because there are two fewer QPI ports on the Xeon E5-4600 than on the Xeon 7500 and E7, the bandwidth between the ports would have been significantly diminished. It was easier and cleaner to make what is in effect a microserver and use the QPI ports to double up out to the new NUMAlink 6 interconnect hub, and that is what SGI has done.

SGI would have no doubt preferred to build the original UV 1000 machines, which debuted in November 2009 and which spanned 128 blades and 256 sockets in a shared memory configuration, using cheaper Xeon 5500 and 5600 processors. But these chips have only one QPI port coming off their sockets and their on-chip memory controllers cannot address as much memory as the Xeon 7500s and E7s, so SGI had no choice but to use the fat Xeons in 2009 and await the less expensive E5-4600s here in 2012.

The memory expansion on the E5-4600 chip is the key to the rejiggered UV 2000 machine, since each processor socket can currently hold a dozen memory slots and address up to 384GB of memory without any external memory buffers or funky chipsets. But the real secret sauce in the UV 2000 is the NUMAlink 6 interconnect, which is a substantial re-engineering of the NUMAlink 5 interconnect that offers about 2.5 times the bandwidth and a much simpler system design as well.

Jill Matzke, director of server marketing at SGI, says that with the NUMAlink 6, a bunch of different things happened all at once. First, SGI's chip fab partner, Avago Technologies, did a process shrink, allowing for more stuff to be crammed onto the chip. (Avago, which is a spinout of Agilent Technologies, itself a spinout from Hewlett-Packard, doesn’t actually make the NUMAlink chips; a fab in Taiwan does.) So SGI could take two of the NUMAlink hubs and put them onto a single chip. SGI could also bring the NUMAlink router onto the ASIC for the first time. Equally important, some of the functions that had been performed by the NUMAlink hub and router using the Xeon 7500 and E7 chips are now done by the Xeon E5s themselves; PCI-Express controllers are one new on-chip function. This is a much simpler set of NUMAlink ASICs. (And you can see now why Intel wants to control the interconnects.)

With the UV 1000 design, there was a node controller in the blade chassis – which the nodes in the chassis shared – and a NUMAlink router at the top of the rack. With the UV 2000, more of the router functionality is contained in that NUMAlink hub/node controller that is on the system board and the node controllers are doubled up for bandwidth. You can scale across two racks of UV 2000 machines without using an external top-of-rack router.

But, says Matzke, if you want to add extra bandwidth across those E5-4600 unisocket blades, you can add NUMAlink 6 routers at the top of the racks, too. This allows customers to dial up the CPU and bandwidth scalability independently of each other with the UV 2000, something you could not do with the UV 1000. The NUMAlink 6 interconnect provides 6.7Gb/sec of bi-sectional bandwidth.

A blade server from the UV2 super

A blade server from the UV2 super (click to enlarge)

The basic node on the UV 2000 has two single-socket servers with a vertical extender card sandwiched between the two stacked motherboards and linking them together with a NUMAlink 6 hub chip. This packaging is similar, in concept, to the "Gemini" blade used in the ICE X Xeon E5-2600 clusters that were previewed last November at SC11 and that started shipping in March of this year. A 10U chassis holds eight half-width nodes, with up to 128 cores and 4TB of memory. A single rack has four of these, for up to 512 cores and 16TB of memory; and a fully loaded UV 2000 has eight racks for a total 2,048 cores and 64TB of global shared memory. If Intel had switched on one more bit in the E5-4600 memory controller, SGI could have pushed the memory up to the full 128TB of memory it is physically possible to put in the 512 nodes in the fully loaded UV 2000 machine. But it didn't, so you can't.

Counting up the cores and threads

The UV 1000 could span as much as 2,560 cores and 5,120 threads using the ten-core E7 chips, but the Linux kernel tops out at 4,096 threads at the moment, so that was as far as the thread count could be pushed. That limit has not changed in the Linux kernel, so a full-on 4,096 core UV 2000, if SGI ever built one, would top out at 4,096 cores and would not be able to take advantage of HyperThreading, which provides two virtual threads per core.

Bill Mannel, vice president of product marketing at SGI, tells El Reg that the Linux kernel usually gets all of the features to support the future NUMAlink interconnect between 9 and 12 months ahead of launch, so the current Linux distros already can run on the new machine. Red hat Enterprise Server 6 and SUSE Linux Enterprise Server 11 are supported right out of the box (er, right form the download), and presumably SGI will soon offer certification on Windows Server 2008 R2 and is working on support for the forthcoming Windows Server 2012, due later this year or early next. SGI has been making a big deal in the past year that it can push Windows Server 2008 R2 to its limit of 256 cores without breaking a sweat. It will be interesting to see if Microsoft and SGI will patch Windows to do a better job scaling across UV 2000s and at least try to compete with Linux on this machine.

The UV 2000 has basically twice the cores and supports four times the global shared memory as the UV 1000 it replaces, and the local read latency on a node is 80 nanoseconds on the UV 2000 compared to 130 nanoseconds on the UV 1000. The full read latency from distant nodes is under 1 microsecond for both machines – you have more than double the bandwidth, but you also have four times the nodes to span. The UV 1000 could deliver around 6 teraflops per rack, and the UV 2000 delivers 11 teraflops. (This assumes top bin parts in both cases, presumably.) The machine has an aggregate of 4TB/sec of aggregate I/O bandwidth across its PCI-Express 3.0 slots, too.

Both the UV 1000 and UV 2000 machines can scale out beyond their 128 and 256 node limits (that's the production scalability on the machine). What you do is use InfiniBand to link a bunch of UV blades together in a fat tree configuration, and then use the NUMAlink interconnect to lash those clusters together into a larger cluster that has globally addressable memory but not tightly coupled shared memory. You can do a maximum of 16,384 sockets across 128 racks with such a monster configuration, which would give you around 1.41 petaflops of number-crunching power and 8PB of addressable memory.

Price point

What has SGI excited with the UV 2000 is not just the increased processing and memory scalability, but the lower cost of the machines compared to the UV 1000. A base UV 1000 node with two eight-core Xeon E7s and 32GB of memory cost $50,000, but Mannel says that SGI can put a UV 2000 node with two Xeon E5-4600s and 32GB into the field for $30,000. That's a 40 per cent price cut, and that will go a long way toward expanding the addressable market of the UV machines if all of the other parts of the machine (extra routers and do on) don't add too much to the cost of a configured system.

SGI will be stressing the memory bandwidth and capacity of the UV 2000 compared to big SMP servers and to flash memory arrays. SGI says that a ProLiant DL980 eight-socket from Hewlett-Packard with 80 cores (Xeon E7) running at 2.26GHz and with 1TB of main memory will cost $93,000 and get you around 7.5 gigaflops for every thousand bucks you spend on the machine. A 128-core UV 2000 with 2.6GHz Xeon E5-4600s and 1TB of memory will cost you $98,000, but you will get over 14 gigaflops for every grand you spend. As for the comparison with flash, SGI says that two Dell rack servers with 1.2TB of high-end Fusion-io flash memory will give you a read/write bandwidth in the range of 2.5GB/sec to 3GB/sec at a latency of between 15 and 47 microseconds, but if you instead use four UV 2000 node enclosures (that's eight CPU sockets) equipped with 1TB of memory, you get a read/write bandwidth of around 236GB/sec and a latency of between 100 to 500 nanoseconds. That's 100 times the performance and 35 times better bang for the buck for the UV 2000 nodes, says Mannel. That is also another way of saying that the UV 2000 nodes are three times as expensive, but you'd expect that when comparing main memory and flash devices.

The other sales tack that SGI will be making is to convince customers that they can do multiple jobs inside of a UV 2000 machine, supporting multiple users doing related work.

Running multiple jobs on the UV 2000

Applied Micro X-Gene ARM block diagram (click to enlarge)

For instance, you can put a block of Nvidia Quadro graphics cards in a chassis in the UV 2000 rack along with some Infinite Storage arrays and the UV 2000 enclosures and create a machine that can do all aspects of virtual product design inside the complex. This includes preprocessing, mesh generation, and model decomposition for the design, running solver programs to do the design, and then post processing and visualization to actually see the design. SGI is allowing customers to plug Intel's forthcoming "Knights Corner" MIC x86 coprocessors into the chassis as well as the Tesla family of GPU coprocessors as well, but has not gone so far as to link one Xeon socket to one GPU like other system designs are doing.

SGI is already shipping one UV 2000 system to a customer in the United States (a "well known auto company") and is working on manufacturing a couple more that will soon ship to customers in Europe, including the UK Computational Cosmology Consortium at Cambridge University where physicist Stephen Hawking gets his paycheck.

In addition to the UV 2000 high-end machine, SGI is also kicking out a four-socket, 2U rack server called the UV 20 that is intended to be a development machine for the UV 2000. ®