TI throws DSPs at supercomputers

Seeing if they will stick, like GPU coprocessors

Protecting against web application threats using SSL

Nvidia had better watch out. Texas Instruments is not only its rival when it comes to making ARM processors that might end up in servers someday, but it is also repositioning its digital signal processors so they can be used as math coprocessors for standard x86 CPUs – and perhaps ARM processors one day.

Nvidia obviously has the pole position when it comes to offloading HPC work from CPUs to GPU coprocessors, thanks in no small part to the development and adoption of the CUDA programming environment that spans CPUs and GPUs. CUDA gives Nvidia an edge over GPUs from Advanced Micro Devices – at least for the moment – but as the history of the computing market has taught us, any advantage can be undermined, just like GPUs are eating into CPUs in hybrid clusters these days. If performing a floating point operation is cheaper on a DSP than it is on a GPU, then it will win – as long as programming for DSPs is not radically more difficult than coding for CPUs and GPUs.

Trying to use DSPs to build supercomputers is not a new idea. Back at the SC92 supercomputing conference, the Swiss Federal Institute of Technology in Zurich was showing off a supercomputer called MUSIC, short for Multi-Signal-Processor System with Intelligent Communication (and yes, that abbreviation doesn't work in English particularly well).

In a paper presented at the conference, Swiss boffins lashed together 60 DSPs and delivered 3.8 gigaflops of number-crunching performance for 800 watts and on neural network learning and molecular dynamics code. This cluster ran five times faster than a Cray Y-MP and two times faster than an NEC SX-3, both of which were vector machines. Columbia University has been monkeying around with parallel DSP machines for a long time and also helped IBM develop its BlueGene family of massively parallel supers. BlueGene is, in essence, a parallel DSP machine that had its brains replaced with PowerPC engines.

At the SC11 event this month in Seattle, Texas Instruments launched its TMS320C66x family of multicore DSPs, adding support for the OpenMP API set to the DSP to make it easier to offload calculations from the CPU to the GPU. DSPs are notoriously hard to program, as GPUs used to be before CUDA and OpenCL came along. The TMS320C66x family of DSPs needs a much easier nickname if it is to become cool and talked about; something like Fourier would seem to be most appropriate, given the use of DSPs to do fast Fourier transforms.

TI DSP block diagram

Block diagram of TI's C66x digital signal processors

The C66x DSPs are based on an architecture that TI calls KeyStone, which allows for anywhere from one to eight DSP cores to be put on a single chip and to share cache memory, main memory controllers, I/O controllers – just like multicore x86 and ARM processors do. The most recent DSP out of TI is called the C6678, and it is designed to scale to eight cores on a single chip, although the four-core version is only shipping at the moment. The DSP cores run at 1GHz or 1.25GHz and with all eight of them humming at 1.25GHz, the C66x delivers 160 gigaflops of single-precision floating point oomph. Like early GPUs, the amount of double-precision math that the DSP chip can do is less than half of this, at 60 gigaflops. The C6678 has 32KB of L1 instruction cache and 32KB of L1 data cache per core and up to 8MB of shared L2 cache per DSP package. The chip has 12.8GB/sec of memory bandwidth into and out of the DSP and, here's the kicker, the chip only consumes 10 watts of juice.

TI quad DSP card

Texas Instruments' quad-DSP PCI Express coprocessor card (click to enlarge)

The initial coprocessor board using the TI C66x DSPs is called the DSPC-8681 and it is made by Advantech. It puts four of these eight-core DSP chips (running at only 1GHz for some reason) on a single half-length PCI-Express 2.0 x8 card. The card has 1GB of DDR3 memory running at 1.33GHz and two Gigabit Ethernet ports. The DSPC-8681 delivers 512 gigaflops at single precision and 192 gigaflops at double precision. This card has a list price of around $1,100.

Kenneth Nesteroff, business development manager for multicore processors at IT's DSP Systems unit, tells El Reg that in the first quarter, Advantech will come out with a full-length PCI-Express card that will deliver around 1 teraflops of single precision performance at a cost of around $2,000 and within a 110 watt thermal envelope.

Longer term, TI plans to pack the performance of the DSPC-8681 card into a single chip package called the TMS320TCI6609 – and then plunk four of these onto a single PCI-Express 2.0 card. TI is not saying how it will get that 512 gigaflops of performance out of a single chip, but it stands to reason that there will be a process shrink, a DSP core count boost, and a faster clock speed. (Or TI could just be packaging up four C6678 DSPs into a single package.)

What TI is saying is that the future TCI6609 DSP will deliver that 512 gigaflops of single precision performance at 32 watts, so a four-chip PCI-Express card will deliver 2 teraflops of single-precision oomph in under 200 watts of total power, including an unknown amount of DDR3 main memory for the DSPs.

What would be even more interesting is if TI would put one of its quad-core Cortex-A8 ARM derivatives on a small form factor system board along with four of these C6678 DSPs, or if it doubled that up to a quad-core Cortex-A15 with maybe eight DSPs on the board. Slap a hybrid InfiniBand/Ethernet ConnectX-3 adapter from Mellanox Technologies on there and you could build a low-power supercomputer.

The hardware is the easy part, of course. The software stack would be a little more problematic. If TI is serious about using DSPs and ARMs in HPC, it is going to have to come up with something more than support for OpenMP and more like Nvidia's CUDA environment. ®

Choosing a cloud hosting partner with confidence

More from The Register

next story
Wanna keep your data for 1,000 YEARS? No? Hard luck, HDS wants you to anyway
Combine Blu-ray and M-DISC and you get this monster
US boffins demo 'twisted radio' mux
OAM takes wireless signals to 32 Gbps
'Kim Kardashian snaps naked selfies with a BLACKBERRY'. *Twitterati gasps*
More alleged private, nude celeb pics appear online
Google+ GOING, GOING ... ? Newbie Gmailers no longer forced into mandatory ID slurp
Mountain View distances itself from lame 'network thingy'
Apple flops out 2FA for iCloud in bid to stop future nude selfie leaks
Millions of 4chan users howl with laughter as Cupertino slams stable door
Students playing with impressive racks? Yes, it's cluster comp time
The most comprehensive coverage the world has ever seen. Ever
Run little spreadsheet, run! IBM's Watson is coming to gobble you up
Big Blue's big super's big appetite for big data in big clouds for big analytics
Seagate's triple-headed Cerberus could SAVE the DISK WORLD
... and possibly bring us even more HAMR time. Yay!
prev story


Secure remote control for conventional and virtual desktops
Balancing user privacy and privileged access, in accordance with compliance frameworks and legislation. Evaluating any potential remote control choice.
WIN a very cool portable ZX Spectrum
Win a one-off portable Spectrum built by legendary hardware hacker Ben Heck
Storage capacity and performance optimization at Mizuno USA
Mizuno USA turn to Tegile storage technology to solve both their SAN and backup issues.
High Performance for All
While HPC is not new, it has traditionally been seen as a specialist area – is it now geared up to meet more mainstream requirements?
The next step in data security
With recent increased privacy concerns and computers becoming more powerful, the chance of hackers being able to crack smaller-sized RSA keys increases.