How two failed capacitors stranded Sydney rail commuters
Stalled by a LAN switch
The railway signaling failure which crippled Sydney on April 12 (some commuters reported trips of more than three hours) was caused by a failing LAN switch and software that couldn’t cope, an engineering report has found.
The switch, probably a Cisco device given that the Borg is Railcorp’s dominant LAN kit supplier, was part of the network in the Sydenham signaling station. That facility governs signaling for a large chunk of the Sydney rail network.
The guilty switch suffered partial failure of two electrolytic capacitors, the report found. The switch is part of a dual redundant LAN which is supposed to be resilient to failure; however, the configuration didn’t account for an intermittent breakdown.
With the caps failing, the switch would shut down and try to re-start itself. This, the engineer’s report says, meant the Sydenham LAN was “caught in a cycle where it was continually trying to reconfigure itself to address the changing state of the network.”
It only took a little over ten minutes for technical staff to initiate a disaster recovery plan, but the procedure took more than an hour to complete. In that time, The software that governs the rail network, known as ATRICS, was unable to cope with the flaky network. This led to a knock-on effect, taking out a system called Microloc at another station, Revesby.
With ATRICS and Revesby’s Microloc system both failing, the network failed to a “safe state”, the report says – in which trains were halted where they were. Because of the hugely interdependent state of the Sydney rail network, 847 trains were delayed and 240 were cancelled, and it took the rest of April 12th for the system to recover.
One of the key recommendations of the report is that “The resilience of the ATRICS software to automatically recover from network disturbances without the need for manual intervention should be addressed as a matter of urgency.”
Amen to that. ®
you obviously don't live in NSW then....
The solution will involve a parliamentary oversight committee run by Rev. Fred Nile (with all of his IT experience) . This should take about 4 years of junket, visiting major railway locations such as Paris, London and Rome, to see how things are meant to work.
The committee will recommend about $2.1 billion spending required to fix problems. A tender by the government will receive bids ranging from $4.32 to $9 billion to fix, a final, FINAL price of $12 billion will be settled on, with a company headed by the Treasurer's brother-in-law.
This will then take another 4 years to implement.
After 6 years, with costs blown out to $18 billion, the government will finally admit that nothing has been done, they can't recall who the person who signed the contract was, or where he lives, the construction company will turn out to be based in a beachside shack in Belize, with bank accounts held in criminals-are-our-friends-Switzerland.
The next day, the signals will fail again, but this time the software will have forgotten the 'safe mode' setting, having been changed by the government to 'make sure every train gets there really fast and early to keep the voters happy' setting.
Chaos ensues, but being NSW this is normal.
The title was my first thought. Although others may bay for blood, it's the kind of oversight that can happen. At least the system failed secure, inasmuch no trains crashed into each other, nobody died (that I know of!), and hopefully they'll now learn from this.
Beer, because it's an Aussie tradition
So the end result...
I imagine they'll have created a small team of fairly well paid technical staff to develop of solution. The solution, in around 12 months time, will be to leave it all alone and hope it never happens that way again. Which it probably won't.