Building the perfect monitoring solution is easy; just take a best of breed monitoring tool such as IBM Tivoli, install it on all your systems, and that’s job done, right? Wrong!
There’s a lot more to implementing a solution that is relevant and appropriate to your business than downloading code and rolling out agents; in fact that’s the easiest part.
The classic errors
There are a number of issues that we see catching out many monitoring projects. By being mindful of these, and the techniques to avoid them described later, you can keep your project on the straight and narrow.
Blanket Coverage: Many customers start off a monitoring project with two main beliefs:
- First, that to get the ROI from their new tool, it has to be rolled out everywhere at once, and
- Second, that every application and operating system is monitored for every eventuality.
They are concerned that something will go wrong and will not be picked up by the new tool, so they deploy to monitor everything, everywhere.
Too Many Events: This invariably leads to too many events, a significant proportion of which result in no useful response. In fact they are worse than having no monitoring at all, as an operator has to spend time acknowledging or closing them without looking at the cause.
Siloed Rollout Plans: It’s all too easy to fall into the trap of rolling out monitoring, based around technology silos. People often think along the lines of “we’ll do UNIX servers first, then Windows”, which seem like a reasonable approach. But when you step back and think about it, how many of your business applications run exclusively on one interpreter type? So, you can inadvertently create a scenario where none of the applications are monitored properly until the whole project is finished.
Incomprehensible Messages: Imagine a scenario when at 3am an operator at your company receives the following event:
An operator (without specific training) will have no idea what to do with this event and will invariably resort to calling out team after team trying to establish if the event is important or not. This can be extremely costly as the man hours spent investigating an event like this mount up.
Unclear resolution processes: Even if the event is understandable, what do you with it? More importantly, how do you ensure that the resolution action for a particular event is consistent, irrespective of who sees it on the console. Unless everyone knows what to do and reacts appropriately, it is highly likely that time and money are going to be wasted.
Inability to prioritise events: It is impossible to prioritise alerts and know which is most important to your business when faced with a long list of angry red alerts. The only reasonable thing an operator can do is start at the top and work his way down. Not the best approach if first event is a CPU 100% busy, but the second alert is an “Internet Banking Down”.
How to avoid the pitfalls
So what is the solution, how can you structure a solution (and project) to avoid these problems? The essence of an effective monitoring implementation is to determine the events to be monitored, their relative priority, and how they will be resolved before you go anywhere near installing your first agent.
Do this in the context of the Business Applications and their KPIs (Key Performance Indicators), rather than focussing on technology silos and you will be well on the way to Monitoring Nirvana.
Decide what you really need to know about: Specifically, ask yourself which events in your infrastructure will impact your service availability (or performance) and which specific business service (and therefore which customer or group of customers) will suffer. This is where understanding your organisation’s KPIs will really help you.
Filter the noise by being selective in your monitoring: Think about the action that will be taken when the event comes in. A large number of events can be dropped by asking a very simple question “if there is no defined action to take, then why are we sending the event?” An example would be the ubiquitous CPU 100% busy alert. If you do not perform any actions when these events arrive then don’t send them. Better to monitor the run queue or load average to see if a system is busy before acting on a CPU busy.
Decide how important an event really is: Define the impact on the business for each event that could come in. To do this you will need to ensure that the event’s text contains which application (or business service) is affected in an alert. Give careful consideration to the potential cost of not resolving the underlying issue (especially if there are SLAs with defined penalties in place). Don’t be tempted to rush this step, as it is key to achieving real value from the completed implementation, and will also result in far better severity definitions.
Make sure support staff know what to do if the event is triggered: Every event should have brief and easily accessible details of how to resolve the problem or at least details of who to pass it to.