IBM spends holiday season wrangling e-tail FAIL
WebSphere-powered site at Oz department store chain takes the week off
Update Australia's largest department store's website crashed for the week of post-Christmas sales, leaving IBM and possibly Oracle scrambling to fix the mess.
The department store chain that suffered the outage is Myer, Australia's analog for the UK's John Lewis or North America's Macy's inasmuch as it is positioned beneath more upmarket alternatives.
In 2008, Oracle published a case study (PDF) trumpeting that Oracle Retail has been implemented at Myer over several years, to give it “a unified view of inventory levels, purchasing, and supplier contracts so they could make better buying decisions.” The case study names IBM as the implementation partner. In 2012, IBM let it be known that its WebSphere software would power Myer's new website.
That site was put to the test on December 26th, traditionally the opening day for a period of heavy retail discounting in Australia. The site quickly crashed under a rush of mouse-wielding bargain-seekers, many of whom would have chased the exclusive deals offered online. The site stayed down until January 2nd. December 26th and January 1st are public holidays in Australia, while the 27th, 30th and 31st of December were theoretically normal business days on which Myer and IBM would have been able to drag their tech teams off the beach and into the office. The weekend of December 27th and 28th afforded further opportunity to work on the site.
As the outage continued, Myer executives told various media, including The Australian Financial Review, that its own IT team and IBM folk around the world were doing all they could to get the site back online.
Just what went wrong has not been disclosed but a Myer spokesperson told The Register that “Communications between the software and servers” was the problem. The surge in volume of shoppers, we were told, was not the cause of the problem as Myer prepared for a rush by laying on extra compute capacity.
The latter nugget of information suggests that perhaps an elastic cloud service was used to provide extra seasonal capacity. We've contacted IBM and it is yet to offer an explanation as to its role in the situation, nor has Oracle confirmed that it is still present at Myer.
Myer's website crash notice was in place for six days
But let's guess that Oracle Retail remains in place, if only because it's not the kind of application that turns over in a hurry, and that the Websphere-powered retail web site makes use of the inventory levels it provides.
As described by Myer, the problem sounds like it could be one of two things:
- A network problem, either on the LAN or WAN;
- A middleware mess of some sort that means messages from an app on one server aren't playing nicely with another.
If the former scenario is the cause of this mess, presumably either a telco or a networking hardware vendor is currently looking at the “penalties” section of their contracts. IBM still sells a little own-brand networking kit but mostly resells boxen made by others, so if the network is the problem we imagine fingers will be pointed in many directions. We'd guess that it's a LAN issue, as telcos are generally pretty good at redundancy. But would a LAN issue result in a week-long outage? Unless something's literally gone up in smoke, it's hard to imagine so. Which leads us to a second scenario.
Might integration between discrete software have come undone, perhaps when the extra seasonal compute capacity was added to the mix? Perhaps links between WebSphere and Oracle Retail - or some other software present at Myer we don't know about - came apart with disastrous consequences?
One thing is almost certain: whoever designed and tended the myer.com.au disaster recovery plan is about to revise their curriculum vitae.
At the time of writing, IBM and Oracle had not provided any comment. Myer's promised to reinstate the offers as soon as is possible, plans an investigation into the incident and is now watching its site's performance carefully to prevent future outages. ®
Update: Friday January 3rd IBM has sent The Reg the following "IBM statement attributed to an IBM spokesperson".
"An IBM team of local and global experts worked around the clock with MYER to resolve the issue with its online store. The technical issue was caused by a communication breakdown between internet servers and a software application. IBM and MYER will work together to conduct a thorough review to ensure this issue does not reoccur. IBM is committed to supporting MYER to continue to provide high quality service to its customers."
"... a communication breakdown between internet servers and a software application" sounds a bit like our hypothesis that scaling to the cloud broke something. We'll keep asking for more detail, but don't expect much now that we're in spokesperson territory.
Sponsored: Benefits from the lessons learned in HPC