TI throws DSPs at supercomputers

Seeing if they will stick, like GPU coprocessors

Top three mobile application threats

Nvidia had better watch out. Texas Instruments is not only its rival when it comes to making ARM processors that might end up in servers someday, but it is also repositioning its digital signal processors so they can be used as math coprocessors for standard x86 CPUs – and perhaps ARM processors one day.

Nvidia obviously has the pole position when it comes to offloading HPC work from CPUs to GPU coprocessors, thanks in no small part to the development and adoption of the CUDA programming environment that spans CPUs and GPUs. CUDA gives Nvidia an edge over GPUs from Advanced Micro Devices – at least for the moment – but as the history of the computing market has taught us, any advantage can be undermined, just like GPUs are eating into CPUs in hybrid clusters these days. If performing a floating point operation is cheaper on a DSP than it is on a GPU, then it will win – as long as programming for DSPs is not radically more difficult than coding for CPUs and GPUs.

Trying to use DSPs to build supercomputers is not a new idea. Back at the SC92 supercomputing conference, the Swiss Federal Institute of Technology in Zurich was showing off a supercomputer called MUSIC, short for Multi-Signal-Processor System with Intelligent Communication (and yes, that abbreviation doesn't work in English particularly well).

In a paper presented at the conference, Swiss boffins lashed together 60 DSPs and delivered 3.8 gigaflops of number-crunching performance for 800 watts and on neural network learning and molecular dynamics code. This cluster ran five times faster than a Cray Y-MP and two times faster than an NEC SX-3, both of which were vector machines. Columbia University has been monkeying around with parallel DSP machines for a long time and also helped IBM develop its BlueGene family of massively parallel supers. BlueGene is, in essence, a parallel DSP machine that had its brains replaced with PowerPC engines.

At the SC11 event this month in Seattle, Texas Instruments launched its TMS320C66x family of multicore DSPs, adding support for the OpenMP API set to the DSP to make it easier to offload calculations from the CPU to the GPU. DSPs are notoriously hard to program, as GPUs used to be before CUDA and OpenCL came along. The TMS320C66x family of DSPs needs a much easier nickname if it is to become cool and talked about; something like Fourier would seem to be most appropriate, given the use of DSPs to do fast Fourier transforms.

TI DSP block diagram

Block diagram of TI's C66x digital signal processors

The C66x DSPs are based on an architecture that TI calls KeyStone, which allows for anywhere from one to eight DSP cores to be put on a single chip and to share cache memory, main memory controllers, I/O controllers – just like multicore x86 and ARM processors do. The most recent DSP out of TI is called the C6678, and it is designed to scale to eight cores on a single chip, although the four-core version is only shipping at the moment. The DSP cores run at 1GHz or 1.25GHz and with all eight of them humming at 1.25GHz, the C66x delivers 160 gigaflops of single-precision floating point oomph. Like early GPUs, the amount of double-precision math that the DSP chip can do is less than half of this, at 60 gigaflops. The C6678 has 32KB of L1 instruction cache and 32KB of L1 data cache per core and up to 8MB of shared L2 cache per DSP package. The chip has 12.8GB/sec of memory bandwidth into and out of the DSP and, here's the kicker, the chip only consumes 10 watts of juice.

TI quad DSP card

Texas Instruments' quad-DSP PCI Express coprocessor card (click to enlarge)

The initial coprocessor board using the TI C66x DSPs is called the DSPC-8681 and it is made by Advantech. It puts four of these eight-core DSP chips (running at only 1GHz for some reason) on a single half-length PCI-Express 2.0 x8 card. The card has 1GB of DDR3 memory running at 1.33GHz and two Gigabit Ethernet ports. The DSPC-8681 delivers 512 gigaflops at single precision and 192 gigaflops at double precision. This card has a list price of around $1,100.

Kenneth Nesteroff, business development manager for multicore processors at IT's DSP Systems unit, tells El Reg that in the first quarter, Advantech will come out with a full-length PCI-Express card that will deliver around 1 teraflops of single precision performance at a cost of around $2,000 and within a 110 watt thermal envelope.

Longer term, TI plans to pack the performance of the DSPC-8681 card into a single chip package called the TMS320TCI6609 – and then plunk four of these onto a single PCI-Express 2.0 card. TI is not saying how it will get that 512 gigaflops of performance out of a single chip, but it stands to reason that there will be a process shrink, a DSP core count boost, and a faster clock speed. (Or TI could just be packaging up four C6678 DSPs into a single package.)

What TI is saying is that the future TCI6609 DSP will deliver that 512 gigaflops of single precision performance at 32 watts, so a four-chip PCI-Express card will deliver 2 teraflops of single-precision oomph in under 200 watts of total power, including an unknown amount of DDR3 main memory for the DSPs.

What would be even more interesting is if TI would put one of its quad-core Cortex-A8 ARM derivatives on a small form factor system board along with four of these C6678 DSPs, or if it doubled that up to a quad-core Cortex-A15 with maybe eight DSPs on the board. Slap a hybrid InfiniBand/Ethernet ConnectX-3 adapter from Mellanox Technologies on there and you could build a low-power supercomputer.

The hardware is the easy part, of course. The software stack would be a little more problematic. If TI is serious about using DSPs and ARMs in HPC, it is going to have to come up with something more than support for OpenMP and more like Nvidia's CUDA environment. ®

High performance access to file storage

More from The Register

next story
This time it's 'Personal': new Office 365 sub covers just two devices
Redmond also brings Office into Google's back yard
Kingston DataTraveler MicroDuo: Turn your phone into a 72GB beast
USB-usiness in the front, micro-USB party in the back
Dropbox defends fantastically badly timed Condoleezza Rice appointment
'Nothing is going to change with Dr. Rice's appointment,' file sharer promises
Inside the Hekaton: SQL Server 2014's database engine deconstructed
Nadella's database sqares the circle of cheap memory vs speed
BOFH: Oh DO tell us what you think. *CLICK*
$%%&amp Oh dear, we've been cut *CLICK* Well hello *CLICK* You're breaking up...
Just what could be inside Dropbox's new 'Home For Life'?
Biz apps, messaging, photos, email, more storage – sorry, did you think there would be cake?
Amazon reveals its Google-killing 'R3' server instances
A mega-memory instance that never forgets
Cisco reps flog Whiptail's Invicta arrays against EMC and Pure
Storage reseller report reveals who's selling what
prev story


Top three mobile application threats
Learn about three of the top mobile application security threats facing businesses today and recommendations on how to mitigate the risk.
Combat fraud and increase customer satisfaction
Based on their experience using HP ArcSight Enterprise Security Manager for IT security operations, Finansbank moved to HP ArcSight ESM for fraud management.
The benefits of software based PBX
Why you should break free from your proprietary PBX and how to leverage your existing server hardware.
Five 3D headsets to be won!
We were so impressed by the Durovis Dive headset we’ve asked the company to give some away to Reg readers.
SANS - Survey on application security programs
In this whitepaper learn about the state of application security programs and practices of 488 surveyed respondents, and discover how mature and effective these programs are.