That sphincter-flexing moment for devs when it's time to go live
No stress test? Then prepare for a potential TSB-like mess
You can't hack the code and manage the project forever: at some point you must go live.
Where once go-live moments might have taken place months or years after the beginning of a project, today techies face go-live moments every day thanks to DevOps and continuous delivery.
Some of them are switching on production versions of greenfield applications for the first time. Others are upgrading existing ones. Those less fortunate are managing massive migrations, moving millions of mission-critical records from legacy systems to newer infrastructure.
Sometimes they hit problems – like TSB.
That go-live moment can be one of the most harrowing of events – even for hardened IT pros. Mistakes ranging from poor testing to inadequate capacity planning and a lack of redundancy or disaster recovery can all wreck a project seconds, days or years after you flip the switch.
Test Systems Better, IBM tells UK IT meltdown bank TSBREAD MORE
In software projects, everything leads to that moment when the team decides to take a project live. Poor planning months beforehand can derail that moment long before it happens, and this is not uncommon. In its 2016 Chaos report, the Standish Group said that 19 per cent of IT projects fail entirely, and that 56 per cent are challenged.
The failure of the Ariane 5 rocket, which exploded after launch in 1996, is one of the most spectacular examples of what can go wrong when someone flicks a switch without everything properly in place. In that case, a mixture of errors compounded to wreck the launch. A simple floating point-to-integer conversion failure in the inertial reference software killed the guidance system. More thorough testing could have caught the problem. The second cause was a lack of redundancy. There was a backup system, but it ran exactly the same software, and so failed in exactly the same way.
Less spectacular was the US Customs and Border Patrol (CBP) outage in January 2017 that left thousands of travellers standing in line for hours nationwide. The CBP had upgraded its computer systems from a legacy mainframe to a newer modern environment, but it didn’t plan properly. A report from the DHS Inspector General found that inadequate system capacity testing was to blame. It also pointed to poor business continuity processes that made it difficult to recover.
So, what are the common problems and how can you minimise your chances of failure or fuck-up and, as a consequence, avoid drawing the ire of your customers, employees and managers?
Jeff Cunliffe, director of Automation Consultants, manages software lifecycles for clients and regularly takes projects live for large customers. His biggest takeaway? Do your homework. Successful teams work months in advance putting the measures in place to avoid disasters on the day.
"The solution has to be the implementation of procedures and checklists to ensure that these transitions are successful," he said.
These measures include a detailed deployment plan and a back-out plan in case things go wrong. Deployment teams should also create a list of change freezes to stop owners of dependent systems from making changes that could wreck the go-live process.
Other important measures include a communication plan to explain exactly what will happen to all stakeholders, both inside and outside the organisation.
When planning for the big move, be sure to include a resource list so that deployment teams aren't caught short without the people and skills to flip the switch. Finally, map out the tasks involved in the system migration in a timeline like this from ServiceNow. That timeline should include go/no-go meetings, as outlined in this particular go-live checklist, so that the implementation team can determine when it is ready to implement the new system.
Even if it's the same application, your approach, your appetite for risk, your stakeholder management, the maturity of the client, the systems integration capabilities, all of those things need to be looked at
Another important must-have to ensure a smooth go-live process is to understand what other systems your project will touch. One of the most significant failures early on in a project is the insufficient discovery of scope, said Cunliffe's fellow director at Automation Consultants, Francis Miers.
This is especially true in poorly documented legacy systems that have grown organically over time, and for which many of the existing developers have left.
"There often isn't full information about the dependency of existing systems. So you need a thorough discovery exercise," Miers said. Talking to employees about how they work with the system can help here, but in-house politics or time constraints mean that they may not always be available.
You can also take a technical approach. Yes, you can use runtime analysis tools to monitor dependencies, but it’s important to run them for an extended period to catch things like jobs that may only run at the end of the month.
Testing is also an essential part of the process, but don't focus purely on unit testing, warned Robert Rutherford, CEO at QuoStar. "[Companies] will test against a particular user base or team, but they won't stress test it adequately," he said. "When they do the live switch over they get all sorts of performance-related issues."
User acceptance testing is also crucial for the new system to ensure that it is what the users expect.
These are some common requirements in most projects, but outside of broad project management methodologies like PRINCE, there are no industry-standard playbooks for going live, especially when migrating between old and new systems, experts tell us.
"Every single project is completely different," said Don Tomlinson, programme director at ECS. With 20 years under his belt working on large implementation projects, he has seen it all. "Even if it's the same application, your approach, your appetite for risk, your stakeholder management, the maturity of the client, the systems integration capabilities, all of those things need to be looked at."
Tomlinson sees three broad approaches when going live with a system. The first is the most favourable for developers: A brand new platform with no existing users and with no dependencies on other systems. Going live with these projects is easier than when adapting or migrating systems that have existing users, because flipping the switch with these systems doesn't affect other infrastructure or applications.
Those migrating from old systems to new ones must pick from the other two approaches. The first is a warm transition, which others call parallel running. Here, you run the new system live alongside the old one to minimise the risk. Running the two systems alongside each other enables the IT team to transition users from one system to another in groups rather than attempting a "big bang" transition where everyone has to jump from the old system to the new one at once.
The IT team can begin with low-impact groups of users whose work isn't as mission critical as others. As the transition progresses, they can carefully monitor these groups to catch any problems. This enables them to localise any issues so that they only affect a small subset of the user base. In new software development projects or upgrades to existing systems, canary releases serve a similar purpose, rolling out versions of the software to small groups of users at a time. It’s a common technique for DevOps teams working to frequent release schedules.
This gradual parallel transition is what Joe Paice and his team at Red Badger did when migrating from one customer-facing, web-based system to another for a client.
"We had a number of customers that we knew personally that we could pilot through the system," he said, explaining that the company then began segmenting users. It had an algorithm that it could use to dial the number of users running through the new system up or down while ensuring that the same users would always be selected. "We encountered bugs, and that's to be expected. We were able to recover from those with a manual fix of the customer's data."
Where you're migrating front-end systems that both rely on the same back-end database, you can often carry on writing to the same back end, said Paice. You'd have to be sure that your new system was writing accurate data, though. Those basing their new application on containers can mirror traffic from Kubernetes pods, he adds, which can give them more flexibility when routing traffic. They might use that to write traffic both to an existing legacy database and a test version of a new one running in parallel, to test that the new system produced records in line with the old one.
The final go-live option is the one that everybody would rather avoid: the big bang cut over. It migrates one system to another during a planned outage period and then immediately decommissions the old one.
"This is when you cut over the system in one go," said Cunliffe. "Sometimes there are situations that you can't back out of." If you're migrating a manufacturing plant control system and the only possible time to cut over any code is during a major plant turnaround, then you might have to execute everything in one go to fit into the turnaround window.
Big bang go-live scenarios have their benefits because you don't have the cost of running two systems in parallel. The downside is the risk that will have made it thorough testing, with mitigating that risk becoming a top priority as a result. Cunliffe said that "dress rehearsals" are an essential activity here. In these tests, you do the whole migration, including all of the data, but stop short of turning off the old one.
The challenge during such cut overs is data migration. "The minute the old system goes offline you have to initiate your copy. That can take several hours, and that's something that in a trial run you measured how long it took," said Cunliffe.
The challenge is that you get a lot of users who find that the system doesn't do something in the same way as the previous system, and think that's a defect that they need to flag up
When you do come to do the real migration, you may be able to use the data that you collected from the legacy system during the dress rehearsal. You’ll still have to collect the new data that the legacy system has produced since the dress rehearsal happened, though. Database transaction logs will often be a vital asset here.
By the time the final migration does happen, the team should have automated as many steps as possible using scripts – making sure that there are also scripts for a rollback. "A good rollback plan will take you back to where ever you were when you took the old system offline without any loss of data," said Cunliffe.
Rollback should cover everything from the network level upwards, supporting traffic routing back to the old system and database transformations to keep data in sync. It should be a part of your dress rehearsal, said Cunliffe, adding a caveat: Sometimes, it may not be impossible to roll back entirely cleanly and avoid some effect on your data.
Ultimately, no matter which go-live strategy you employ, it's important not to forget the human factor. Companies must learn how to manage their people after the go-live, when the support calls start coming in as – even if a system works correctly – you can expect a lot of calls.
Ivan Ericsson, head of quality management at SQS Group, said: "I would have some kind of early life support. The challenge is that you get a lot of users who find that the system doesn't do something in the same way as the previous system, and think that's a defect that they need to flag up. You get a lot of noise in the early days."
Are you ready for all that? It's not a job we would want. It sounds a little like changing your car engine while in the overtaking lane on the motorway. If someone does saddle you with this task, then careful planning and a lot of testing could be the things that help you avoid becoming the next embarrassing headline. ®