Original URL: https://www.theregister.com/2009/10/26/tilera_third_gen_mesh_chips/

Tilera pushes to 100 cores with mesh processor

Super-efficient Linux engine

By Timothy Prickett Morgan

Posted in Channel, 26th October 2009 05:02 GMT

Upstart massively multicore chip designer Tilera has divulged the details on its upcoming third generation of Tile processors, which will sport from 16 to 100 cores on a single die.

That will give Tilera bragging rights for cramming the most general purpose computing cores onto a die, although graphics chip makers are already jamming hundreds of cores onto a die. For instance, the future Fermi graphics chips from Nvidia, which will appear in GeForce graphics cards as well as in Tesla graphics co-processors, have 512 cores on a single chip.

Intel is cooking up its own 64-core Larrabee graphics processor, which will reportedly have a single, simplified x64 core paired with a vector coprocessor and a shared, coherent memory structure that allows the L2 caches for the memories on each core to be accessible to all the other cores. Intel could be planning to ship Larrabee chips with 8, 16, 32, 48, and 64 cores, using 45 nanometer high-k metal gate processes and probably as much as 128 cores with a shrink to 32 nanometer processes.

There are a lot of differences between the Tile family of chips and these graphics processors, but the key one is that the Tile multicore processors run Linux directly (albeit a homegrown one) and are being designed not just for digital signal and networking processing, but to run that standard LAMP stack - Linux, Apache, MySQL, and PHP.

The architecture of the original Tile64 and second-generation TilePro36 and TilePro64 chips has not changed dramatically with the Tile-Gx series of chips, which are being preview today even though they will not start shipping until the end of 2010. Tilera, says Bob Doud, director of marketing, is talking about its roadmaps with its partners, and rather than have the information just leak out all over the place is telling the world what it is up to more or less at the same time. (Which it doesn't have to do, being a privately held company and all).

When Tilera came out of stealth mode in August 2007, it had a 64-core RISC-like chip in an 8x8 grid with a sophisticated mesh interconnect, now called iMesh, that links all of the cores and their L2 caches into a single, coherent system using five different mesh interconnects all liked together by a switch implemented in silicon. This chip was implemented in a 90 nanometer process from Tilera's fab partner, Taiwan Semiconductor Manufacturing Corp, and it was put on a PCI-Express card for potential customers to play with to see how they might put it in network security, video streaming, and other products.

These original Tile64 processors implemented 32-bit cores (which I strongly suspect but do not know are based on a heavily streamlined and tweaked MIPS RISC design, since Tilera has not disclosed the instruction set of the core). Each Tile64 chip could run at 700 MHz or 866 MHz, and each had 8 KB L1 data and instruction caches and 64 KB of L2 cache that was meshed together into a shared 5 MB coherent cache that functioned like an L3 cache for all the cores. The Tile64 had four DDR2 main memory controllers, two Gigabit Ethernet ports, two PCI-Express controllers, two 10 Gb/sec XAUI interfaces, and two flexible I/O ports for supporting flash or disk drives.

With the second generation of Tilera chips, the TilePros, announced last September, the L1 caches were boosted to 16 KB, the L2/L3 cache was boosted to 5.6 MB, and the company put out a 36-core variant as well as the 64-core version.

These chips also included a new feature called "hash for home," probably code-named Amsterdam, that automagically spread data in the caches across the processors so a set of cores and caches didn't end up being hot spots. Tilera also added SIMD instructions to assist with video and audio handling plus some new memory manipulation instructions. The Tile chips were also giving a memory striping feature, akin to RAID data striping on disk drives, and another feature to allow a core to have direct access to cache memory without having to go through main memory to get a snippet of data.

Even though the clock speed didn't change for the TilePro36 and TilePro64 chips, the performance boost was somewhere between 1.5 to 2.5 times with only a 5 per cent increase in power consumption. And this was all accomplished by staying at the 90 nanometer processes, which are basically dirt cheap. The TilePro64 consumed 19 to 23 watts running real workloads, and the TilePro36, which only clocked at 500 MHz and which had only 3.2 MB of L2/L3 cache, ranged at 10 to 16 watts under load.

The proverbial next level

Now, after getting some design wins with network intrusion appliance vendors, and streaming media and video conferencing suppliers, and wireless equipment suppliers, Doud says Tilera is ready to take it to the next level. And with the Tile-Gx series of chips, the company is skipping from TSMC's 90 nanometer bulk processes to its cutting-edge 40 nanometer tech. The massive shrink is allowing Tilera to crank up the clocks as well as adding more features to the chip.

(FYI: Last year at this time, Tilera said that it was working on a 120-core Tile chip that was due in late 2008 or early 2009, but this clearly did not happen).

The Tile-Gx family of chips is designed to offer clock speeds up to 1.5 GHz, and the 100-core version of the chip is expected to offer about four times the performance of the TilePro64 and somewhere between seven and eight times the performance of the original Tile64 chip that debuted two years ago. That is well above and beyond the Moore's Law curve, provided your workload can scale across lots of cores.

"We believe that we will deliver the highest performance processors in the world," says Doud, matter-of-factly. And he says that the Tile-Gx100 100-core part will best an Intel Nehalem-EX on infrastructure workloads - that is the forthcoming 8-core, 16-thread monster chip - by a factor of four running Linux infrastructure workloads. Tilera also plans to demonstrate next year that its Tile-Gx chips can offer four times the bang of Cavium Network's Octeon II processor (a 32-core MIPS-based chip which is used for deep packet inspection), and 30 times the oomph of the multicore DSPs from Texas Instruments (as gauged by its performance in doing audio encoding for wireless networks).

With the Tile-Gx chips, Tilera is making the shift to 64-bits, allowing for 64-bit data and memory addressing. And on the 100-core version of the chip, which sports four on-chip DDR3 main memory controllers, a single chip is able to address a whopping 1 TB of main memory at speeds running up as high as 2.13 GHz. That is a lot of memory for a single socket, and even without any floating point math units on the chip, there are plenty of workloads where this chip is going to get traction with these kind of numbers.

Like, Yahoo and Google, just to name two. And anyone in life sciences where integer performance and parallelism is more important than floating point. (Having said that, it is interesting to ponder the possibility of pairing a Tile-Gx chip with a graphics co-processor for floating point work).

Each core on the Tile-Gx chips has 32 KB of data and instruction cache and 256 KB of L2 cache; those L2 caches are turned into a 26 MB virtual L3 cache. The Tile-Gx chips also have additional SIMD instructions that make use of a four multiplier-accumulator (MAC) per cycle unit that can deliver 600 billion MACs per second, which Tilera says is 12 times the fastest digital signal processor on the market today.

The chips also spore two MiCA engines, short for Multistream iMesh Crypto Acelerator and able to deliver 40 Gb/sec of bandwidth on cryptographic work and 20 Gb/sec on compression and decompression. The chip also includes a packet processing accelerator that sits between the cores and the on-chip network interfaces called mPIPE, short for multicore programmable intelligent packet engine), which does load balancing between the cores and the network interfaces.

The Tile-Gx100 chip is the most complex chip, with a 10x10 grid of cores all linked by the iMesh. Memory controllers are along the top and bottom edges (two on top, and two on the bottom). The MiCA crypto units are in the left-hand corners, and three PCI-Express 2.0 I/O processors (two eight-land and one four-lane) and various I/O controllers are there, too. (With a total of 80 Gb/sec of PCI-Express I/O bandwidth).

The right-hand side of the chip has eight XAUI ports that can be configured as a total of 32 Gigabit Ethernet ports or eight 10 Gigabit Ethernet ports. Because everything needed in the system is on the Tile chip, there is no northbridge or southbridge chipset. Tilera plans to run the Tile-Gx100 at between 1.25 GHz and 1.5 GHz. But you are going to have to wait until the first quarter of 2011 to get your hands on one and you had better be prepared to shell out around $1,000 a pop for the chip in low volumes.

The Tile-Gx64 is a similar chip, with an 8x8 core array and the same PCI-Express and flexible I/O options as well, but only has eight network I/O ports. The cache is also cut back to 20 MB, and iMesh bandwidth is cut in half to 100 Tb/sec. The Tile64-Gx will follow the Tile-Gx100 to market "a few months later," according to Doud.

The Tile-Gx36 is the first chip of the new generation of Tilera processors that will come to market in the fourth quarter of 2010. Doud says that customers were keen on the 36-core variant first, so it rejiggered the roadmap, which originally had the Tile-Gx100 coming out first. The Tile-Gx36 has a 6x6 core grid at the center of the chip. It has only one MiCA unit, only four configurable Ethernet ports, and only two memory controllers. This variant will be available in 1 GHz, 1.25 GHz, and 1.5 GHz clock speeds with a 12 MB of L2/L3 cache. Pricing is expected to be under $400 a chip in low unit volumes, with discounts for higher volumes.

The Tile-Gx16, with a 4x4 grid of cores, has a geared down MiCA unit, two memory controllers, and three configurable Ethernet ports on the chip; the L2/L3 cache is cut back to 5.2 MB. It will be available in at the end of 2010.

Depending on the configuration, the Tile-Gx family of chips sports a power envelope of between 10 and 55 watts, which is perhaps going to make it a big seller.

The Tile-Gx chips will run the Linux 2.6.26 kernel and add-on components that make it an operating system. Apache, PHP, and MySQL are being ported to the chips, and the programming tools will include the latest GCC compiler set. (Three years ago, Tilera had licensed SGI's MIPS-based C/C++ compilers for the Tile chips, which is why I think Tilera has also licensed some MIPS intellectual property to create its chip design, but the company has not discussed this.) Next year, as the third-generation chips are being ramped to market, the company will be offering a port of Java to its homegrown Linux and chip architecture as well. ®