Software

This article is more than 1 year old

Software dev 101: 'The best time to understand how your system works is when it is dying'

Architect for failure, sure, but know that it will never be easy

Tue 8 Mar 2016 // 08:03 UTC

QCon London At the QCon Developer conference underway in London, William Hill's R&D Engineering Lead Gavin Stevenson told attendees that they should celebrate IT failures.

"The best time to understand how your system works is when it is dying," he said.

QCon is a vendor-neutral event focused on large-scale software development and architecture, and is relatively hype-free.

Stevenson's underlying point is that examining how an application fails when under stress is more illuminating than simply observing it working. Failure identifies the limitations of the system. That said, his team works in R&D, rather than on production systems, and the real goal is to avoid failure.

William Hill is a Java shop; but its next-generation bet settlement system under development is written in Erlang, which is designed for concurrency. "The syntax is simple," said Stevenson, "and the supervisor hierarchy makes it really nice to work with."

Supervisors, part of Erlang's OTP (open telecom platform) library, manage child processes and restart them when necessary, adding resilience.

Stevenson's team decided to use an in-memory database for performance. They tested the system by using a log of all the bets placed for last year's Grand National, over 6.2 million, and replaying them as fast as possible.

"Our app failed, which was brilliant," he said. There was "massive contention" in the database and excessive memory consumption, over 50GB.

A redesign using sharding (a technique for partitioning the data), load-balanced supervisors, distributed logging using Apache Kafka and multiple betting engines, a new design which avoids having a new Erlang process for every bet: all these things resulted in a resilient, scalable system that could process 6 million bets in 20 minutes.

Stevenson's team also relies on Docker containers for deployment. "Everything we do in R&D, it's Docker," he said; though they have struggled with container load-balancing and orchestration. "There isn't a brilliant solution," he said, though they are looking at Docker Swarm, a product for clustering Docker engines.

"It's a reactive microservice-based architecture," said Stevenson. "Probably. Nobody seems to agree what microservices is."

Making your application fail, then, is a handy tool for application development; but only one small piece in the wider task of designing resilient systems.

At an "open space" discussion which followed Stevenson's talk, William Hill's relatively clean-room development story, in the comfort of R&D, seemed remote from the reality facing many businesses.

One attendee, in the financial services industry, lamented the many dependencies in the system he managed, any one of which could stop things working. The core problem was a legacy back-end system including IBM's WebSphere MQ, SOAP web services and JDBC (Java) database calls. "It's 30 years of legacy," he said. "When will we get the budget to fix it? Not in my lifetime."

Nor is today's rush towards microservices architecture a complete solution. Each microservice is a dependency, and what happens when one breaks? "You have to dig into why it doesn't work, how do you react quickly?" asked an attendee.

Even if you think you know how it should be done, implementing today's best practice in the real world is a huge challenge. ®

Topics

Special Features

Vendor Voice

Resources

Software

Software dev 101: 'The best time to understand how your system works is when it is dying'

Architect for failure, sure, but know that it will never be easy

More about

More about

Narrower topics

More about

More about

More about

Narrower topics

TIP US OFF

Other stories you might like

Meet clickjacking's slicker cousin, 'gesture jacking,' aka 'cross window forgery'

US Air Force says AI-controlled F-16 fighter jet has been dogfighting with humans

Debian spices up APT package manager with a dash of color, squishes ancient bug

Reducing the cloud security overhead

Miracle-WM tiling window manager for Mir hits 0.2.0

IT consultant-cum-developer in court over hiding COVID-19 loan

Wing Commander III changed how the copy hotkey works in Windows 95

YouTube now sabotages ad-blocking apps that stream its vids

Law prof predicts generative AI will die at the hands of watchdogs

Ex-CEO of 'unicorn' app startup HeadSpin heads to jail after BS'ing investors

Tiny11 Builder trims Windows 11 fat with PowerShell script

Microsoft teases deepfake AI that's too powerful to release

About Us

Our Websites

Your Privacy