The Register® — Biting the hand that feeds IT

Feeds

TI throws DSPs at supercomputers

Seeing if they will stick, like GPU coprocessors

Cloud storage: Lower cost and increase uptime

Nvidia had better watch out. Texas Instruments is not only its rival when it comes to making ARM processors that might end up in servers someday, but it is also repositioning its digital signal processors so they can be used as math coprocessors for standard x86 CPUs – and perhaps ARM processors one day.

Nvidia obviously has the pole position when it comes to offloading HPC work from CPUs to GPU coprocessors, thanks in no small part to the development and adoption of the CUDA programming environment that spans CPUs and GPUs. CUDA gives Nvidia an edge over GPUs from Advanced Micro Devices – at least for the moment – but as the history of the computing market has taught us, any advantage can be undermined, just like GPUs are eating into CPUs in hybrid clusters these days. If performing a floating point operation is cheaper on a DSP than it is on a GPU, then it will win – as long as programming for DSPs is not radically more difficult than coding for CPUs and GPUs.

Trying to use DSPs to build supercomputers is not a new idea. Back at the SC92 supercomputing conference, the Swiss Federal Institute of Technology in Zurich was showing off a supercomputer called MUSIC, short for Multi-Signal-Processor System with Intelligent Communication (and yes, that abbreviation doesn't work in English particularly well).

In a paper presented at the conference, Swiss boffins lashed together 60 DSPs and delivered 3.8 gigaflops of number-crunching performance for 800 watts and on neural network learning and molecular dynamics code. This cluster ran five times faster than a Cray Y-MP and two times faster than an NEC SX-3, both of which were vector machines. Columbia University has been monkeying around with parallel DSP machines for a long time and also helped IBM develop its BlueGene family of massively parallel supers. BlueGene is, in essence, a parallel DSP machine that had its brains replaced with PowerPC engines.

At the SC11 event this month in Seattle, Texas Instruments launched its TMS320C66x family of multicore DSPs, adding support for the OpenMP API set to the DSP to make it easier to offload calculations from the CPU to the GPU. DSPs are notoriously hard to program, as GPUs used to be before CUDA and OpenCL came along. The TMS320C66x family of DSPs needs a much easier nickname if it is to become cool and talked about; something like Fourier would seem to be most appropriate, given the use of DSPs to do fast Fourier transforms.

TI DSP block diagram

Block diagram of TI's C66x digital signal processors

The C66x DSPs are based on an architecture that TI calls KeyStone, which allows for anywhere from one to eight DSP cores to be put on a single chip and to share cache memory, main memory controllers, I/O controllers – just like multicore x86 and ARM processors do. The most recent DSP out of TI is called the C6678, and it is designed to scale to eight cores on a single chip, although the four-core version is only shipping at the moment. The DSP cores run at 1GHz or 1.25GHz and with all eight of them humming at 1.25GHz, the C66x delivers 160 gigaflops of single-precision floating point oomph. Like early GPUs, the amount of double-precision math that the DSP chip can do is less than half of this, at 60 gigaflops. The C6678 has 32KB of L1 instruction cache and 32KB of L1 data cache per core and up to 8MB of shared L2 cache per DSP package. The chip has 12.8GB/sec of memory bandwidth into and out of the DSP and, here's the kicker, the chip only consumes 10 watts of juice.

TI quad DSP card

Texas Instruments' quad-DSP PCI Express coprocessor card (click to enlarge)

The initial coprocessor board using the TI C66x DSPs is called the DSPC-8681 and it is made by Advantech. It puts four of these eight-core DSP chips (running at only 1GHz for some reason) on a single half-length PCI-Express 2.0 x8 card. The card has 1GB of DDR3 memory running at 1.33GHz and two Gigabit Ethernet ports. The DSPC-8681 delivers 512 gigaflops at single precision and 192 gigaflops at double precision. This card has a list price of around $1,100.

Kenneth Nesteroff, business development manager for multicore processors at IT's DSP Systems unit, tells El Reg that in the first quarter, Advantech will come out with a full-length PCI-Express card that will deliver around 1 teraflops of single precision performance at a cost of around $2,000 and within a 110 watt thermal envelope.

Longer term, TI plans to pack the performance of the DSPC-8681 card into a single chip package called the TMS320TCI6609 – and then plunk four of these onto a single PCI-Express 2.0 card. TI is not saying how it will get that 512 gigaflops of performance out of a single chip, but it stands to reason that there will be a process shrink, a DSP core count boost, and a faster clock speed. (Or TI could just be packaging up four C6678 DSPs into a single package.)

What TI is saying is that the future TCI6609 DSP will deliver that 512 gigaflops of single precision performance at 32 watts, so a four-chip PCI-Express card will deliver 2 teraflops of single-precision oomph in under 200 watts of total power, including an unknown amount of DDR3 main memory for the DSPs.

What would be even more interesting is if TI would put one of its quad-core Cortex-A8 ARM derivatives on a small form factor system board along with four of these C6678 DSPs, or if it doubled that up to a quad-core Cortex-A15 with maybe eight DSPs on the board. Slap a hybrid InfiniBand/Ethernet ConnectX-3 adapter from Mellanox Technologies on there and you could build a low-power supercomputer.

The hardware is the easy part, of course. The software stack would be a little more problematic. If TI is serious about using DSPs and ARMs in HPC, it is going to have to come up with something more than support for OpenMP and more like Nvidia's CUDA environment. ®

Steps to Take Before Choosing a Business Continuity Partner

Actually the software probably is the big advantage

Unlike GPUs DSPs always had an open instruction set. So it was always fairly easy to program them. It's so simple you can even program them in assembler.

Because of those open architectures, institutions buying those computers can develop Fortran compilers for them within weeks. And once you have Fortran, you can run most HPC software.

3
0
Anonymous Coward

"the compiler should target the end platform for them and identifiy and parralize(sic) there code for the platform it is running upon. "

That's where I came across 3L (see above).

Parallel C (language, compiler, and runtime) targeting (sometimes reconfigurable) arrays of processing power of various kinds (Transputers back in the day, duly followed by DSP and other clever stuff).

You can find the 3L Parallel C User Guide for Transputers (vintage 1989) at

http://www.transputer.net/prog/72-tds-179-00/book.asp for the frontmatter

and

http://www.transputer.net/prog/72-tds-179-00/3lparcug.pdf for the full 270+ pages.

There is a freely downloadable ancient version of 3L Parallel C for the Transputer as one of the various Transputer-related bits available via

http://www.classiccmp.org/transputer/languages.htm (for best results, start at /transputer/)

I wonder, given the way high end ARMs are going in terms of performance and performance per watt and performance per watt per dollar... no, best not.

Hello Peter.

Enjoy.

1
0

DSPs for graphics

Yes, at the time (late 80s to early 90s) using (discrete) DSPs for workstation graphics was pretty common. IBM's "Megapixel" display (a Sony Trinitron monitor at 1024x1024x8bpp, hence the name) for the RT PC was driven by a DSP-based adapter, for example.

0
0

More from The Register

SCO vs. IBM battle resumes over ownership of Unix
Zombie lawsuit back and wants to suck the brains out of Linux
 breaking news
You don't need phone lines or cable for ANYTHING, says Dish
The satellite-dish man can sort you out with phone and broadband over the air too
 breaking news
What's HP got under wraps? Looks awfully flash and tape shaped
What happens in Vegas won't stay there - we've got the details
Microsoft borks botnet takedown in Citadel snafu
Stupid Redmond kicked over our honeypots, wail white hats
IBM's $1bn layoffs latest: Now axe swings in US, Canada - reports
Union claims 121 storage bods canned after dismal sales
NetApp musters muscular cluster bluster for ONTAP busters
Storage array OS overhauled to juggle more nodes, go down on you, er, less
HP adds 'Haswell' Xeon E3s to entry ProLiant servers
Gussies up MicroServer for SMBs, adds baby switches
Buffalo herds DDR3 RAMs into DriveStation's spinning rust corrals
Claims cache-packed gear keeps up with flash drives