Following three days of service disruption, the Financial Times reported that last week’s BlackBerry outage was caused by, “a failure of a core network switch in a data centre in Slough”. So what would have gone on behind the scenes as 30 million BlackBerry users lost access to their mobile email messages and could the disruption have been avoided?
There are a number of monitoring tools that RIM could have used to spot that the switch was in trouble before the service was so critically impacted. There would have been error messages in the log files and technical performance indicators that would have shown that things were falling below KPI levels long before the switch failed.
Many commentators in the press and on Twitter expressed surprise that a global cloud service provider could experience such an outage. We all expected that RIM would have more robust systems and better failover technology in place so that, even if a core switch failed, the service could carry on without customers seeing any disruption.
A global cloud-based service such as RIM’s needs to be multi-modal, with application performance management in place, so that CIOs can see whether the system is serving more requests than it can handle. Service providers need to be able to simulate user connections and response times, so that they can spot potential service disruptors early enough to prevent them having an impact on end users.
I would advise putting a monitor into the Slough data centre to pull in data from around the world. Third party providers can do this and deliver an average use pattern, so that increases in use over time can be observed, as well as trends month on month, year on year. Using this data, service providers can spot spikes in usage or other anomalies that would alert them if the infrastructure is subjected to unusually heavy loading.
In the event that the alert indicates a critical event such as an imminent switch failure, this allows the service provider to take remedial action. This could either mean that failover systems work seamlessly and customers don’t notice any degradation of service, or it could enable the provider to warn customers of upcoming disruptions.
By communicating with customers and reducing the load on the system, this would allow the service provider to put more focus on maintaining mission critical services such as email. Several CIOs made the point that many other service providers have suffered outages: what frustrated them most was the lack of communication from RIM. The RIM service disruption bore the hallmarks of a provider that hit the incident response button too late, which indicates that ongoing monitoring was not taking place.
Other commentators have queried why the service went down on Monday, was partially restored on Tuesday and was then disrupted again on Wednesday. The way in which services are brought back online is key after a failure. The priority of the incident manager would have been to get the service up and running again, even if that meant using an inelegant solution on a temporary basis.
With millions of users flooding the system as the service was restored on Tuesday 11th October, the subsequent load of all those BlackBerry devices reconnecting to the system probably caused it to fall over again. The issue on Tuesday was likely one of sheer volume rather than a secondary material defect or configuration failure.
So what can we learn from last month’s outage?
My advice to cloud service providers that have been alarmed by the RIM outage would be to look at resilience at a stack level rather than just looking at the storage layer or network layer. Consider ongoing monitoring of your end to end service: from data centre to end user device.
You can’t change your whole infrastructure overnight. However, you can manage operational threats by changing how you look at your service and making the necessary adjustments to maintain an acceptable user experience.
My advice to businesses is that you cannot abdicate responsibility to your cloud service providers. If email is a mission critical service for your business, ensure that you have your own service level monitoring in place, along with an alternative service to fall back on if your service provider suffers an outage. Failures will happen and you need to plan for this and put contingency measures in place so that your business is not disrupted.