The Register® — Biting the hand that feeds IT

Feeds

Wanna boost app speed? Think of the server, and tune 'er to NUMA

And maybe chuck in some flash-based DIMMs, too

Free ESG report : Seamless data management with Avere FXT

HPC on Wall Street The HPC on Wall Street conference was hosted in the Big Apple on Monday, and while there was a lot–and we mean a lot–of talk about big data, one presentation stood out as being potentially much more useful in the long run than all of the big data bloviations.

The talk was given by one of the founders of a financial services software maker, who walked the audience through his company's efforts to boost performance through coding apps to be aware of the underlying NUMA architecture of servers.

The techies from 60East Technologies, the maker of the AMPS (Advanced Message Processing System) publish/subscribe messaging system that was created to be the engine behind financial services applications, are also beta-testing flash-based DIMM memory sticks from Diablo Technologies to "crank up the AMPS", as founder and CEO Jeffrey Birnbaum put it.

Depending on the parts of the application that are running, the AMPS application has anywhere from 50 to 100 threads executing at the same time. Some of those threads are running the SQL-like messaging database at the heart of the system, which is called State of the World, while others run the publishing and subscription code that pulls in or pushes out data. The subscription is done via SQL-like queries.

On a current two-socket server using "Sandy Bridge-EP" Xeon E5-2600 v2 processors, you can get 16 cores and 32 threads in a box and that is about it. The software allows for multiple input publishing applications which are then routed to multiple subscribers.

Financial apps: Latency, latency, latency

Much of financial applications is taking data from multiple sources, aggregating it, parsing it, and then streaming it out to multiple subscribers. And with such streaming applications – and especially the trading and other applications that depend upon those streams – latency is everything. And so even in a two-socket server, the latencies between local memory access in one socket and remote memory access in the other socket can make a big difference to the performance of the overall application.

Birnbaum says that in a typical two-socket Xeon E5 server, local memory access is on the order of 100 nanoseconds. But if you have to jump over to the main memory associated with the second socket in the system, any accesses through the QuickPath Interconnect that links the two sockets using non-uniform memory access (NUMA) clustering can take anywhere from 150 to 300 nanoseconds, with 300 nanoseconds not being unusual for outliers.

"You are taking a severe performance penalty," explained Birnbaum, who was preaching that application developers needed to become aware of NUMA tuning and start doing it in their applications as 60East has done.

How much of a performance hit are we talking about? Well, based on his own AMPS application, Birnbaum is right. It is a pretty big hit, and this awful picture from his presentation shows it, however fuzzily:

The AMPS app can handle a lot more 1KB messages after NUMA tuning

The AMP app can handle a lot more 1KB messages after NUMA tuning

See that? No? Squint a bit...

This presentation will eventually be available on the 60East site, and deepest apologies, it went by so fast we only got this terrible shot of it. But the important thing is that you can see the curves. The lines at the bottom of both charts, which are relatively flat, are the average latencies across all messaging transactions, and if you looked at only this data, you would think everything is hunky dory. But if you look at the average latency of the slowest five per cent of message transmissions in the AMP application, you will see that the messaging rate gets pushed, then the outliers start to creep up very, very fast.

"If you have one message that is really bad, that is not good for most environments," said Birnbaum. He maintains that this is why you have to look at more than average latencies if you are analyzing code on NUMA machines.

In the case of AMPS 3.3, which was not tweaked to pin threads and memory together in NUMA machines so they were not socket hopping for data, after you push it up to about 50,000 messages per second, the latencies start to spike. With AMPS 3.5, the latest release of 60East's software, the company's programmers used various tools to analyze the memory accesses as AMPS was running and then learned how to group threads and memory together to cut down on socket hopping.

With the NUMA tuning, AMPS 3.5 was able to push close to 1 million messages per second and still cut the outliers' latencies by more than half. And the real bottleneck at that point was the PCI-Express bus and the 10Gb/sec Ethernet network interface card. With a 40Gb/sec Ethernet card, Birnbaum thinks AMPS 3.5 could probably hit 2 million messages per second.

"This tells you that you want to take your time and program for NUMA," said Birnbaum. And he warns against doing too much reference counting in C and C++ (which is a common way to share data among threads by passing around pointers) can wreak havoc on performance. "You also try to put low-priority stuff in an application on the second socket in the system where you can take the latency hit," he says.

Programmer? I hardly NUMA...

60East employed a number of tools to tune up AMPS 3.5 for two-socket NUMA Xeon E5 servers. The first is the libnuma, which is used to set memory access policies in the Linux kernel.

The company also made use of Pin from Intel, which is used to check for memory references in the code.

And an intrepid programmer at Intel, frustrated by the lack of visibility into NUMA applications, has created an open-source tool called NumaTop to analyze processes and threads and their memory accesses on NUMA systems. There are a bunch of others, too.

But the important thing about NUMA tuning is to do it. "The only way to do this well is that you have to play," said Birnbaum, and that may sound a little bit odd coming from Wall Street. "You have to read, you have to learn, and you have to experiment. But the results will be dramatic.

The other thing that 60East's programmers have been doing is making use of flash-based sticks that plug into main memory slots in the server to help boost the performance of AMPS even further. The messaging platform was designed so that transaction logs can be turned on or off.

You want to turn the logs on because that helps speed up the resynchronization of subscribers if they get knocked offline. But disk drives are too slow and memory is too skinny. As it turns out, Diablo's MCS (Memory Channel Storage) flash sticks come in 200GB and 400GB capacities, and 60East was able to get eight of the 200GB units to plug into a Xeon E5 alongside the dozen sticks of DDR3 that gave the system 128GB of main memory. This MCS memory has drivers that make it look like another tier in the storage hierarchy of the server.

With the MCS flash DIMMs in the two-socket server, the AMP software was able to push 4.16 million messages per second, compared to 1.18 million messages per second without it.

That is still not enough to make Birnbaum happy, though. "Most of our performance comes from memory and cores," he says. "Ivy Bridge Xeons are welcome to us, and Haswell would be even better."

But what Birnbaum really wants is an integrated network interface on a Xeon chip, something he says he told Intel was necessary back in 2001. The new "Avoton" Atom C2000 chips have integrated Ethernet network interface controllers, and maybe, just maybe, the future Haswell Xeons due next year will too. As far as El Reg knows, there have been no rumors of integrated Ethernet controllers on the impending Xeon E5 v2 chips based on Ivy Bridge. ®

5 ways to reduce advertising network latency

Whitepapers

5 ways to reduce advertising network latency
Implementing the tactics laid out in this whitepaper can help reduce your overall advertising network latency.
Supercharge your infrastructure
Fusion­‐io has developed a shared storage solution that provides new performance management capabilities required to maximize flash utilization.
Avere FXT with FlashMove and FlashMirror
This ESG Lab validation report documents hands-on testing of the Avere FXT Series Edge Filer with the AOS 3.0 operating environment.
Reg Reader Research: SaaS based Email and Office Productivity Tools
Read this Reg reader report which provides advice and guidance for SMBs towards the use of SaaS based email and Office productivity tools.
Email delivery: 4 steps to get more email to the inbox
This whitepaper lists some steps and information that will give you the best opportunity to achieve an amazing sender reputation.

More from The Register

next story
Dedupe-dedupe, dedupe-dedupe-dedupe: Flashy clients crowd around Permabit diamond
3 of the top six flash vendors are casing the OEM dedupe tech, claims analyst
Disk-pushers, get reel: Even GOOGLE relies on tape
Prepare to be beaten by your old, cheap rival
Hong Kong's data centres stay high and dry amid Typhoon Usagi
180 km/h winds kill 25 in China, but the data centres keep humming
Microsoft lures punters to hybrid storage cloud with free storage arrays
Spend on Azure, get StorSimple box at the low, low price of $0
WD unveils new MyBook line: External drives now bigger... and CHEAP
Less than £0.04/GB, but it loses the Thunderbolt speed
VMware vSAN test pilots: Don't panic but there's a chance of DATA LOSS
AHCI SATA controller won't play nice with Virtzilla's robo-storage beta
Pure poaches NetApp preacher
Stewart dumps disk array drama to fluff flash
StorNext gets revamp, Quantum claims 5x data throughput boost
Multi-threaded code, flash, metadata redesign and Infiniband support
prev story