TSB have had a disaster to deal with. Lots of press coverage and all for the wrong reasons. A major IT change that resulted in days of customer chaos. Disruption that lasted for over a week.
IT infrastructures are increasingly complicated, particularly in the large Banks, but I think it is irresponsible of people to say they are held together with sticking plaster. However, the complexity does mean that implementing change requires strong discipline and management.
Here are some points to consider.
The devil is in the detail. Therefore, take time to do some serious analysis and planning. I say “some” because it’s important to stop short of analysis paralysis. The important thing is to get all contributing individuals/teams together and work through the logistics to develop a step-by-step plan and schedule. That requires firm facilitation and project management skill.
Do a dry run. Find out what doesn’t work. Fix it. Then do another dry run. Repeat. Repeat.
Work out what verification tests you’re going to run when the change has been made and before you make the decision that it has been implemented successfully. In the TSB case, it seems they didn’t stress test the infrastructure with sufficient volumes to expose the underlying problem.
Work out where people are going to be. It may not be possible to have everyone in the same physical location during the implementation activity, so communication will be key.
Follow the plan. Hold regular communication checkpoints (these should be in your plan). Use the technology to keep up to date – instant messaging, video and conference phone facilities.
Know when your “drop dead” time is – the point at which you have to say “stop” if it isn’t going to plan. This timing should give you enough time to revert / back out your change. You should have detailed how long back-out will take during your planning stage.
When you realise something has gone wrong
This requires just as much firm management and communication as the planning and implementation. While you can’t prepare for the exact nature of the incident, you can prepare for how you will go about dealing with it. The following are vital:
Management responsibility – somebody who is clearly in charge and coordinating the troubleshooting. It may also be worth setting up a war room.
Diagrams and documentation – that everyone has access to so there’s common understanding of components and names.
Clarity – it is common for different symptoms to arise with complex infrastructures; make sure you’re recording the symptoms with as much detail as possible. Times, locations, observations, specifics.
Communication vehicle – agreed instant messaging and conference facilities so that everyone is clear about what the next step is.
Coverage plan – the longer the incident goes on, the more people will get tired. When people are tired, they don’t think as clearly. Rotate people so that they can get some rest and be fresh.
After – when it’s given the all clear
A full “drains-up” review where everyone can contribute positively and without fear of the stupid Houses of Parliament cry – “is the minister going to resign?”. Remember to do a review after every change – there will be lots of good lessons to learn from successful change implementations as well as the disastrous ones!
It will be very interesting to see what comes out of the TSB Project Implementation Review / Major Incident Review. I’m looking forward to hearing what they’ve found.