My last post dealt with monitoring and insight, reacting and optimizing as the two sides of the automation coin. Because monitoring and reacting are not enough when you are dealing with events, you also have to analyze and predict them as far as possible.
Especially if the event occurs in the shape of an error. Thinking about application assurance is thinking about how to handle change. And not necessarily about how to deal with alerts or trouble tickets which pop-up in your IT monitoring or business service management solution. Because when the problem occurs you are already on the reaction side of the automation coin trying to reduce the time it takes to fix a problem. The better and more sustainable approach to change would be to think about how we can turn this coin and prevent errors before they occur.
Of course, there is no perfect situation, and unforeseeable events happen all the time. Therefore, you will never get rid of the monitoring and reaction side. But talking seriously about application assurance you should at least be able to have an eye on both – what currently is going on and what is upcoming too.
Proper alert reaction needs insight
Take for example a job which is scheduled to start in 5 minutes. And then, suddenly, the alert comes from your monitoring tool that the database load is too high at the moment and the service aligned with the job will fail or at least slow down. Starting a manual investigation of the case is a kamikaze mission. But if you have pattern based-rules you can define options which can be automatically run through.
Note you that you need a lot of insight into the whole system to answer the question of whether to reschedule the job when the database load is under 50% or to immediately allocate additional resources on a virtual basis. 1) You have to know the latest possible time to start the job without causing subsequent errors. And 2) you have to evaluate this job and know all the job-related SLAs (Service Level Agreements) to know if it’s even worth the effort to allocate additional resources.
Don’t forget: This insight must be available and automatically lead to a decision when the alert happens. And even then you may be running out of time. Take the same job scheduled not in 5 minutes but in two seconds – which in daily operations is often the remaining time after you have reached the threshold (e.g. 80% CPU usage) and the service is down.