“WHEN WILL YOU GET SERVICE RESTORED?” shouts the Senior Vice President again, cutting across the technical discussions on the bridge call, for the second time in five minutes.
“The service restart will take forty minutes” replies the Senior Technical Lead. “YOU HAVE TO GET IT BACK IN TEN!”
When managing a business-damaging incident, it is clear why representatives from ‘the business’ behave the way they do – behaviour which (depending on the prevailing culture of the organisation) can range from reasonable to psychotic. When critical systems go down and affect the minute to minute revenue streams – even for a minute – financial losses can be huge.
A strong IT tech support person will rightly say, “calm down and let us get on with our work”, but few will have the strength and experience to handle sustained aggression from the business side.
We know from analysis and experience that the normal response – jumping to cause – costs more money, reputation and time than calmly working through the evidence. But there are two other crucial factors interfering with good sense under pressure – biology and psychology.
Given the external pressures, there are four key drivers to effective performance that consultants have defined and revised:
- Predictable performance – how we use our brain and skills to create the circumstances and environment that people need to handle complex problems under severe time pressure
- Infrastructure – the tools and systems we use to enable exemplary support
- Feedback – the learning and improvement loop
- Right channels – making sure we have effective communications in major incidents
Many people believe that coping with emergency situations is a matter of in-built character, somehow genetically determined. Use of advanced brain scanning techniques has in recent years allowed researchers to find out what‘s really happening in human brains when people are put into stressful situations, and what stresses trigger the strongest neural responses. Knowing these triggers and reducing them can have an immediate benefit to personal health and business performance.
Brain Hardware is wired for two kinds of response – Threat and Reward. Threat is an easily triggered response which takes resources from the prefrontal cortex; it inhibits complex problem solving and drives a survival instinct. Reward releases dopamine, improves collaboration and increases the quality of rational thinking.
The SCARF Model
The triggers for the ‘Threat and Reward’ response are based on a surprisingly small number of stress factors; these five (the SCARF Model1) exert the strongest influences on individual performance:
- Status is about relative importance to others
- Certainty concerns being able to predict the future
- Autonomy provides a sense of control over events
- Relatedness is a sense of safety with others, of friend rather than foe
- Fairness is a perception of fair exchanges between people
Five things to do that will help:
1) For managing status anxiety: filter out threats: many companies now operate two simultaneous bridges; Technical and Management. The people fixing stuff should be protected from disturbance and left to get on with what they do best. Management need to handle the legal, political and reputational implications of the outage. The flow between the two channels has to be managed carefully, and that means keeping progress visible for all. Make the current status of the incident readable and distributed – so that someone can read the current status and not interrupt the bridge with ‘what’s going on?’
2) Certainty through process: making your recovery and resolution process visible and in recognisable small steps is important. If senior people – and clients – can go and look and see that you’re making progress that will tend to ease the shrill demands.
Practice on the easy: If your current case records look rather scruffy, how likely is it that crisp clear reporting will suddenly get done in an emergency? Get folks to use good quality consistent troubleshooting on everyday cases, and they’ll be better able to handle the tough ones.
The best people are often hidden away. Some years ago a colleague ran a project in a manufacturing plant. Tracing faults was often hard since operators tended to tweak their machines all the time, and figuring out what change related to what fault was nigh on impossible.
Except for Colin. Colin made constant little notes: “10:43: ring former out of alignment. Turned adjuster one quarter turn clockwise. Product all OK”. People with an eye for detail are safe hands in tough times.
Every incident has a natural process, and it isn’t trial and error. Keep the list of issues, impacts and consequences visible to all, as well as the investigation and resolution actions. When we read Major Incident process guides, we too often find them focused on who calls who, and not enough about HOW the incident needs to be progressed to ensure a quality outcome.
Creating this path is one of the key things we can do to use the SCARF model, since having a recipe for success is an important way of increasing certainty. This also allows the visibility that management and regulators expect.
When using a clearly described method, with visible stage gates, people apply their creativity when it matters most, and their analytical skills where they are needed, giving people the opportunity to make a contribution, visibly.
3) Autonomy is cemented by making sure that the boundaries are there, and everyone knows what to do when they reach them. For example, in most incidents an accurate time for the start of the symptoms is important for diagnosis and decision making. You can leave engineers free to decide how they get this information (log files, user experience, downstream effects etc), but get it they must, and at a gallop, if recovery is to be effected quickly. There are other steps, such as a careful risk assessment prior to the recovery action which you may mandate, but leave the details up to the incident managers.
4) Relatedness helps the incident handling process, and not just in major incidents. This was recently illustrated by a change we made to case escalation in a client. Instead of just filling in the severity rating according to predefined criteria (Impact 1-4), Tier 1 engineers were instructed to describe in words what the problem was like for the users (all forty outbound mortgage staff unable to process applications). We found higher Tiers (2 and 3) much more responsive when the situation was clearly explained to them.
5) Fairness can be lost if the ‘messenger is shot’. People shout at doctors treating their relatives, even though the doctor is not at all responsible for the disease in question, and are doing their best. Expecting teams to perform well when the solution is outside their control is one of the biggest mistakes. Blaming folks for factors beyond their control doesn’t enhance performance, and loads unfairness. Recognising teams appropriately for their contribution is important in building an environment that encourages predictable performance.
“I DEMAND THAT SYSTEM BACK IN TEN!”
I understand your concern, and I hear that you want it back in ten minutes, but that’s not how it works…”