Intel bolts bonus gubbins onto Skylake cores, bungs dozens into Purley Xeon chips
Inside Chipzilla's new security measures
Posted in Servers, 12th July 2017 03:19 GMT
Deep dive Intel has taken its Skylake cores, attached some extra cache and vector processing stuff, throw in various other bits and pieces, and packaged them up as Xeon CPUs codenamed Purley.
In an attempt to simplify its server chip family, Chipzilla has decided to rebrand the components as Xeon Scalable Processors, assigning each a color depending on the sort of tasks they're good for. It's like fan club membership tiers. There's Platinum for big beefy parts to handle virtualization and mission-critical stuff; Gold for general compute; and Silver and Bronze for moderate and light workloads.
Before we get stuck in, here's a summary of the base specifications, compared to last year's Broadwell-based Xeon E5 v4 gang, plus the system architecture and socket topology. You'll notice not only is there an uptick in core count to boost overall performance, there's a mild leap in power consumption, too...
And here's Intel's slide laying out the main changes between the Skylake desktop cores and the Skylake cores in the Scalable Processor packages – AVX-512 vector processing, and more L2 cache, plus some other bits, basically. The ports in the diagram below refer to ports from the out-of-order instruction scheduler that feeds instructions into the core's various processing units – see the microarchitecture diagram on the next page for a wider context.
So let's look at what you can now order, or at least enquire about, from today:
Xeon Platinum 81xx processors
Up to 28 cores and 56 hardware threads, can slot into one, two, four or eight sockets, can clock up to 3.6GHz, and each has 48 PCIe 3.0 lanes, six memory channels handling 2666MHz DDR4 and up to 1.5TB of RAM, up to 38.5MB of L3 cache, three UPI interconnects, AVX-512 vector processing with two fused multiple-and-add units (FMAs) per core. The power really depends on the part, going all the way up to about 200W.
Xeon Gold 61xx processors
Up to 22 cores and 44 hardware threads, can slot into one, two or four sockets, can clock up to 3.4GHz, and each has 48 PCIe 3.0 lanes, six memory channels handling 2666MHz DDR4 and up to 768GB of RAM, up to 30.25MB of L3 cache, three UPI interconnects, AVX-512 vector processing with two FMAs per core. The power really depends on the part, going all the way up to about 200W.
Xeon Gold 51xx processors
Up to 14 cores and 28 hardware threads, can slot into one, two or four sockets, can clock up to 3.7GHz, and each has 48 PCIe 3.0 lanes, six memory channels handling 2400MHz DDR4 and up to 768GB of RAM, up to 19.25MB of L3 cache, two UPI interconnects, AVX-512 vector processing with a single FMA per core. The power really depends on the part, going up to about 100W.
Xeon Silver 41xx processors
Up to 12 cores and 24 hardware threads, can slot into one, two or four sockets, can clock up to 2.2GHz, and each has 48 PCIe 3.0 lanes, six memory channels handling 2400MHz DDR4 and up to 768GB of RAM, up to 16.5MB of L3, two UPI interconnects, AVX-512 vector processing with a single FMA per core. The power really depends on the part, going up to about 85W.
Xeon Bronze 31xx processors
Up to eight cores and eight hardware threads, can slot into one or two sockets, can clock up to 1.7GHz, and each has 48 PCIe 3.0 lanes, six memory channels handling 2133MHz DDR4 and up to 768GB of RAM, up to 11MB of L3 cache, two UPI interconnects, AVX-512 vector processing with a single FMA per core. The power really depends on the part, going up to about 85W.
Intel has made a handy decoder chart for the part numbers. We note that the old Xeon E5 and E7 family map to the Gold 5xxx group.
So what's actually new? What makes these server-grade Skylakes as opposed to the Skylakes in desktops and workstations? The big change is Intel's new mesh design. Previously, Chipzilla arranged its Xeon cores in a ring structure, spreading the shared L3 cache across all the cores. If a core needed to access data stored in an L3 cache slice attached to another core, it would request this information over this ring interconnect.
This has been replaced with a mesh design – not unheard of in CPU design – that links up a grid of cores and their L3 slices, as seen in the Xeon Phi family. This basically needed to happen in order to support more cores in an efficient manner. The ring approach only worked well up until a point, and that point is now: if you want to add more cores and still get good bandwidth and low latency when accessing the shared L3 pool, a mesh – while more complex than a ring – is the way forward.
That's a fine mesh you've got me into ... The red lines represent bidirectional transfer paths and the yellow squares are switches at mesh intersections (Click to enlarge either picture)
A core accessing an adjacent core's L3 cache, horizontally or vertically, takes one interconnect clock cycle, unless it has to hop over an intersection, in which case it takes three cycles. The mesh is clocked somewhere between 1.8 and 2.4GHz depending on the part and whether or not turbo mode is engaged. So in the diagram above, a core in the bottom right corner accessing a core's L3 cache to its immediate left takes one cycle, and four cycles to the next cache on the left (one hop then three hops).
Speaking of caches, the shared L3 blob has been reduced from 2.5MB per core to 1.375MB per core, while the per-core private L2 has been increased from 256KB to a fat 1MB. That makes the L2 a primary cache with the L3 as an overflow. The L3 is also now non-inclusive from inclusive, meaning lines of data in the L2 may not exist in the L3. In other words, data fetched from RAM directly fills the core's L2 rather than the L2 and the L3.
This is supposed to be a tune-up to match patterns in data center application workloads, particularly virtualization where a larger private L2 is more useful than a fat shared L3 cache.
You can also carve up a die into sub-NUMA clusters, a system that supersedes the previous generation's cluster-on-die design. This – as well as the mesh architecture, various new power usage levels, and the new inter-socket UPI interconnect – is discussed in detail, and mostly spin free, by Intel's David Mulnix here. UPI is, for what it's worth, a coherent link between processors that replaces QPI.
There's also an interesting new feature called VMD aka Intel's volume management device: this consolidates PCIe-connected SSDs into virtual storage domains. To the operating system, you just have one or more chunks of flash whereas underneath there are various directly connected NVMe devices. This technology can be used to replace third-party RAID cards, and it is configured at the BIOS level. The Purley family also boasts improvements to the previous generation's memory reliability features for catching bit errors.
While these new Xeons share many features present in desktop Skylake cores, there's another new thing called mode-based execution (MBE) control. This is supposed to stop malicious changes to a guest kernel during virtualization. It repurposes the execution enable bit in extended page table entries to allow either execution in user mode or execution in supervisor (aka kernel) mode. By ensuring executable kernel pages cannot be writeable, a hypervisor can prevent guest kernels from being tampered with and hijacked by miscreants exploiting security bugs. This is detailed in section 3.1.1 in this Intel datasheet.
Under the hood
That MBE protection mechanism isn't present in client flavors of Skylake processors, we're told. The Purley chips do have other Skylake security features, such as page protection keys. These are described in section 4.6.2 of volume 3a in Intel's software developer manual. Basically, for each thread, you use the RDPKRU and WRPKRU instructions to set the PKRU register with a 32-bit word that controls the current thread's access permissions to its virtual memory space, which is divided up evenly into 16 domains. In this 32-bit word, each domain can be individually marked read-write, read-only, or inaccessible. So, for example, one domain in the middle of the virtual address space could be marked read only and all other domains marked as inaccessible. If the running thread, in this case, tries to access a domain area that's locked out, or tries to write to any of them, it will trigger an exception. It can only read from the area of memory covered by the domain marked accessible.
This is on top of the usual per-page read-write permissions. It effectively allows you to, for instance, ensure that consumer threads only have read access to areas of the virtual address space containing certain blobs of data, and producer threads have read-write access. This application hardening requires support from the compiler toolchain, operating system, and hypervisor if necessary, to work properly.
Finally, there's also Intel's Memory Protection Extension (MPX), documented here, that allows threads to define upper and lower bounds for memory accesses and check they are not exceeded when working on buffers, which may help prevent buffer overflow and underflow attacks. This requires a lot of setup by the toolchain, introducing some overhead, and is expected to be used during app development rather than in production.
Next, there's better support for time stamp counters in virtualization, providing a more friendly way for virtual machines to access these counters when migrating across platforms, reducing a small overhead. It's documented again in volume 3a of the developer's manuals, in sections 24.6.5, 5.3, and 220.127.116.11). Diving deeper, there's the wonderfully CISCy CLFLUSHOPT instruction for flushing a cache line with a lower latency, and CLWB for writing a cache line to RAM without invalidating it.
The Purley gang supports AVX-512 as found in the high-end Xeon Phi family. This opens up floating-point and integer calculations using 512-bit vectors.
This is a good time to note that the 48 lanes of PCIe 3.0 are split into three independent pipelines. You can have one VMD domain per x16 PCIe. The Lewisburg C62x chipset that accompanies Purley has an integrated X722 Ethernet controller that can handle up to four 10Gbps ports, features Intel's QuickAssist tech, offers up to 14 SATA 3 interfaces, up to 10 USB 3.0 as well as 14 USB 2.0 ports, and packs in TPM 2.0, vPro, AMT, Node Manager 4.0, NVM Express support, and RSTe. We're told the QuickAssist acceleration can churn through up to 100Gbps of IPSec and SSL-encrypted data, and perform up to 100,000 2048-bit RSA key decryptions per second, 100Gbps of Deflate compression, and 150Gbps of AES-128 CBC ciphering in 4KB blocks. Your mileage may vary on these speeds, but in any case, it means servers handling encrypted network traffic and data can offload this work to the hardware, which is nice for everyone's security.
The chipset includes not only Intel's usual Management Engine, but also the annoyingly titled Innovation Engine: this is a tiny computer within the server that can be used to monitor and remotely manage the machine. It is powered by a 412-DMIPS Intel Quark x86 CPU with 1.4MB of RAM and the usual ME interfaces plus a standard serial interface.
Crucially, system builders are expected to provide the IE's firmware, not Intel, to customize their products. In other words, server box makers can bundle extra firmware that runs in the IE below any running operating systems and hypervisors. You don't just have to worry about Intel's buggy and potentially insecure ME code running out of sight, there may be vendor-bundled software, too.
We're told these specifications are not set in stone – so take with a pinch of salt
† Four ports total: ports 0 & 1 can run up to 10GbE, while ports 2 & 3 are limited to 1GbE
And for chip design nerds like your humble hack, here's a peek at the microarchitecture changes within the server Skylakes:
Again, Intel is keen to highlight the fat 1MB private L2 cache per core, the AVX-512 with one or two FMAs per core, a heftier branch predictor, and various other tweaks to accommodate demanding programs.
Finally, for comparison, Intel reckons its 28-core 165W Xeon Platinum 8176 scores 41 per cent better than a 2016 22-core 145W Broadwell E5-2699 v4 in the SPECINT_RATE_BASE2006 compute benchmark, 53 per cent better in SPECFP_RATE_BASE2006, 64 per cent in Stream Triad memory access tests, and 98 per cent in the Linpack round. The 28-core 205W Platinum 8180 does even better: 53 per cent in SPECINT_RATE_BASE2006, 64 per cent in SPECFP_RATE_BASE2006, also 64 per cent in Triad, and 125 per cent in Linpack.
It's not too surprising to see that, by increasing the core count and opening up the power throttle, you get a faster family of chips, once you've tuned and optimized the design.
Intel is pitching this stuff not just at traditional data center applications but also supercomputing and AI, of course, and software-defined networking at telcos and other comms carriers. AT&T, for one, has been using these new processors since March in production.
Don't forget to check out Timothy Prickett Morgan's analysis of today's launch, along with a table summarizing the available options and dollar-per-clock comparisons, over on our sister site The Next Platform. We'll come back to the Xeons' architecture soon in our dive into AMD's Epyc and Ryzen designs. ®