Tilera gooses 64-core mesh processor
Wonders what it's good for
A year ago, at the annual Hot Chips conference for chip designers in Silicon Valley, a company called Tilera came out of stealth mode and launched its 64-core Tile64 mesh processor. The Tile64 chip takes multi-core to an extreme, and an on-chip iMesh network allows a grid of cores and memory controllers to compete with X64 or DSP processors doing a variety of work.
This week, Tilera is putting its second-generation chips into the field and is getting some traction among various IT suppliers, who want to put the Tile64 processors and their homegrown Linux environment to work.
The Tile64 chip announced last year and the TilePro64 and TilePro36 kickers announced this week are not based on any existing processor cores and their associated instruction sets. The chips embody a new core that was designed from the ground up to take advantage of mesh networking on each core. This creates a large pool of compute resources that can be dedicated to running a single instance of Linux and its applications or carved up on the fly into virtual Linux images, each isolated from other virtualized slices.
Before getting into the changes in the new TilePro chip, let's review the first-generation device. The Tile64 core was a 32-bit design (with a 16-bit mode) that employs RISC and VLIW concepts. It can do three instructions per clock cycle, and the chip's speed ranges from 600 MHz to 1 GHz. The Tile64 chip has 64 KB of L2 cache as well as L1 data and instruction caches that are 8 KB in size each.
The switch that is at the heart of the Tile64 processor actually implements five different mesh networks - one each for memory access, streaming packet transfers, user data network, cache misses, and interprocess communications. Wrapped around the cores are four DDR2 main memory controllers, two Gigabit Ethernet ports, two PCI Express controllers, two 10 Gb/sec XAUI interfaces, and two flexible I/O interfaces to support peripherals such as compact flash memory or disk drives.
The whole shebang is implemented in a 90 nanometer process and made by Taiwan Semiconductor Manufacturing.
The Hardwall
The Tile64 design is clever in a number of ways, which means it might see some use in IT devices near you someday soon. First, it does not use a bus architecture to talk to peripherals or to have processors and cache memory talk to each other. The iMesh network allows point-to-point communication between the chips and does away with bus architectures, which require high clock speeds and lots of energy to deliver bandwidth and scale.
The Tile64 chip also uses the mesh network so L2 caches on each core can be used like a giant L3 cache in a traditional design. Basically, any core can look into the L2 cache of any other core on the chip and treats that like a giant 5 MB L3 cache. While each core on the Tile64 chip can run its own complete instance of Linux, the cache coherency engendered in the mesh network means that a collection of cores can be setup to run an SMP setup of Linux, too.
The iMesh network controls all communication into and out of a core, a microcode feature called Multicore Hardwall Technology can partition a Tile64 into multiple virtual machines, allowing different instances of Linux and their applications to run on the chip and be isolated from each other. The Tile64 chip supports a variant of the Linux 2.6 kernel and has a tweaked version of the open source GNU C compiler and the open-source Eclipse integrated development environment.
Next page: Hash for Home
COMMENTS
Transforming the TILE64 into a Kick-Ass Parallel Machine
The CPU is a soon-to-be a dinosaur, an ancient technology that somehow escaped from a museum of the last century. Tilera should design its own processor core. The CPU, a sequential core, has no business doing anything in a parallel processor. It does not make sense. You're either parallel or sequential, take your pick. What is needed is a pure MIMD vector processor and dev tools based on a true parallel programming model, as opposed to the old multithreaded sequential crap. Tilera should get in bed with Nvidia and Nvidia should change its SIMD GPU into a pure MIMD vector processor. This way, they can have a homogeneous multicore processor that can handle anything, not just graphics. Tilera's Imesh technology is just what is needed to solve the cache coherency problem and will serve as an ideal vehicle for effective hardware-enabled load balancing. Heck, Nvidia should immediately acquire Tilera by making them an offer they can't refuse. Nvidia has the chance of a lifetime to dominate the computer industry for decades to come.
The writing is on the wall. The CPU is dead. Good riddance. MIMD vector processing is the name of the new multicore game.
Transforming the TILE64 into a Kick-Ass Parallel Machine:
http://rebelscience.blogspot.com/2008/08/transforming-tile64-into-kick-ass.html
FPUs
The lack of them just means that the OS has to emulate one, not too difficult since they can easily be expressed as two integer values, the mantissa and the location of the decimal(provided I haven't confused the terminology). Not as efficient, but definitely doable. But honestly, with 64 cores, I doubt it's going to matter too much.
A couple of points
If they've got a modified GNU compiler running, then it's safe to say either OpenMP or MPI will at some point be usable on such chips.
Secondly, I note that they do mention DSP chips -- they'll be able to handle floating point. DSPs seem to be the way HPC's going anyway (large matrix operations on CPUs run into processor cache performance limitations), which is why the likes of CUDA and CTM exist.
FPU
I agree that missing an FPU is a dent in their usability.
Why not just replace a couple of their cores with FPUs? With the cache architecture they have, it would be a simple matter to run SIMD on a bunch of data in the cache, offloaded and separate to the processor core, queued and shared between cores.
I actualy think this would be great for standard CPUs aswell. Consider a Phenom (for example) with an additional FPU, so a core can just ensure the data is available in L3, and then queue a few SIMD or MIMD routines for the FPU to run. It then gets told when the job is complete, and can grab the data and do what it wants. In fact AMD is probably in the best situation to do this, as it could use designs from ATI's gfx chips to do it.
Oh well. The Tile[Pro] chips are still cool, and I want one (or more) to play with. Bet they'll be out of my price range though...
Need development environments
I'd love to see something like this aimed at a reasonable price for developers. Every time a technology like this arrives, it's always in the stratosphere for price.
