Intel bolts bonus gubbins onto Skylake cores, bungs dozens into Purley Xeon chips
Inside Chipzilla's new security measures
Under the hood
That MBE protection mechanism isn't present in client flavors of Skylake processors, we're told. The Purley chips do have other Skylake security features, such as page protection keys. These are described in section 4.6.2 of volume 3a in Intel's software developer manual. Basically, for each thread, you use the RDPKRU and WRPKRU instructions to set the PKRU register with a 32-bit word that controls the current thread's access permissions to its virtual memory space, which is divided up evenly into 16 domains. In this 32-bit word, each domain can be individually marked read-write, read-only, or inaccessible. So, for example, one domain in the middle of the virtual address space could be marked read only and all other domains marked as inaccessible. If the running thread, in this case, tries to access a domain area that's locked out, or tries to write to any of them, it will trigger an exception. It can only read from the area of memory covered by the domain marked accessible.
This is on top of the usual per-page read-write permissions. It effectively allows you to, for instance, ensure that consumer threads only have read access to areas of the virtual address space containing certain blobs of data, and producer threads have read-write access. This application hardening requires support from the compiler toolchain, operating system, and hypervisor if necessary, to work properly.
Finally, there's also Intel's Memory Protection Extension (MPX), documented here, that allows threads to define upper and lower bounds for memory accesses and check they are not exceeded when working on buffers, which may help prevent buffer overflow and underflow attacks. This requires a lot of setup by the toolchain, introducing some overhead, and is expected to be used during app development rather than in production.
Next, there's better support for time stamp counters in virtualization, providing a more friendly way for virtual machines to access these counters when migrating across platforms, reducing a small overhead. It's documented again in volume 3a of the developer's manuals, in sections 24.6.5, 5.3, and 220.127.116.11). Diving deeper, there's the wonderfully CISCy CLFLUSHOPT instruction for flushing a cache line with a lower latency, and CLWB for writing a cache line to RAM without invalidating it.
The Purley gang supports AVX-512 as found in the high-end Xeon Phi family. This opens up floating-point and integer calculations using 512-bit vectors.
This is a good time to note that the 48 lanes of PCIe 3.0 are split into three independent pipelines. You can have one VMD domain per x16 PCIe. The Lewisburg C62x chipset that accompanies Purley has an integrated X722 Ethernet controller that can handle up to four 10Gbps ports, features Intel's QuickAssist tech, offers up to 14 SATA 3 interfaces, up to 10 USB 3.0 as well as 14 USB 2.0 ports, and packs in TPM 2.0, vPro, AMT, Node Manager 4.0, NVM Express support, and RSTe. We're told the QuickAssist acceleration can churn through up to 100Gbps of IPSec and SSL-encrypted data, and perform up to 100,000 2048-bit RSA key decryptions per second, 100Gbps of Deflate compression, and 150Gbps of AES-128 CBC ciphering in 4KB blocks. Your mileage may vary on these speeds, but in any case, it means servers handling encrypted network traffic and data can offload this work to the hardware, which is nice for everyone's security.
The chipset includes not only Intel's usual Management Engine, but also the annoyingly titled Innovation Engine: this is a tiny computer within the server that can be used to monitor and remotely manage the machine. It is powered by a 412-DMIPS Intel Quark x86 CPU with 1.4MB of RAM and the usual ME interfaces plus a standard serial interface.
Crucially, system builders are expected to provide the IE's firmware, not Intel, to customize their products. In other words, server box makers can bundle extra firmware that runs in the IE below any running operating systems and hypervisors. You don't just have to worry about Intel's buggy and potentially insecure ME code running out of sight, there may be vendor-bundled software, too.
We're told these specifications are not set in stone – so take with a pinch of salt
† Four ports total: ports 0 & 1 can run up to 10GbE, while ports 2 & 3 are limited to 1GbE
And for chip design nerds like your humble hack, here's a peek at the microarchitecture changes within the server Skylakes:
Again, Intel is keen to highlight the fat 1MB private L2 cache per core, the AVX-512 with one or two FMAs per core, a heftier branch predictor, and various other tweaks to accommodate demanding programs.
Finally, for comparison, Intel reckons its 28-core 165W Xeon Platinum 8176 scores 41 per cent better than a 2016 22-core 145W Broadwell E5-2699 v4 in the SPECINT_RATE_BASE2006 compute benchmark, 53 per cent better in SPECFP_RATE_BASE2006, 64 per cent in Stream Triad memory access tests, and 98 per cent in the Linpack round. The 28-core 205W Platinum 8180 does even better: 53 per cent in SPECINT_RATE_BASE2006, 64 per cent in SPECFP_RATE_BASE2006, also 64 per cent in Triad, and 125 per cent in Linpack.
It's not too surprising to see that, by increasing the core count and opening up the power throttle, you get a faster family of chips, once you've tuned and optimized the design.
Intel is pitching this stuff not just at traditional data center applications but also supercomputing and AI, of course, and software-defined networking at telcos and other comms carriers. AT&T, for one, has been using these new processors since March in production.
Don't forget to check out Timothy Prickett Morgan's analysis of today's launch, along with a table summarizing the available options and dollar-per-clock comparisons, over on our sister site The Next Platform. We'll come back to the Xeons' architecture soon in our dive into AMD's Epyc and Ryzen designs. ®