If you have to deal with any kind of monitoring/alerting system, something you should be concerned about is alert quality. This is because either the people getting the bad alerts are getting on your case about it or you’re the person getting the bad alerts.
Why does alert quality matter?
When that pager goes off in the hour of the wolf, the responsible person should wake smelling a whiff of brimstone and shiver like a Ringwraith has just passed. Because bad things have happened, are happening, or are about to happen. But we can’t do that without quality alerts because they’ll just learn to ignore the alerts.
What are quality alerts?
Alerts that are
- Actionable. Something can be done about the condition that generated the alert. This may not always mean that it can be fixed right away.
- Targeted. The person who gets the alert is the person who can do something about it.
- Informative. The alert should contain enough information to understand on its own without opening another console. Many SCOM management packs are fairly bad at this, especially the ones that say “See alert context for more details”.
- Non-noisy. The alert should not cry wolf all the time. There is a trade off to be made between how quick you get alerted to a bad condition and how noisy the alerts are for “normal” transient conditions.
How can we improve alert quality?
One way would be to implement an alert quality reporting framework. Some attributes we would want in this would be:
- Simple. No extra clients/websites to login to.
- Quick. The minimum required response is good or bad, with additional detail possible.
- Network agnostic. Alert recipients should be able to respond from wherever they got the alert, whether it’s their iPhone via a text message or Outlook via corporate e-mail.
A simple way to do this in SCOM would be to write a webservice and inject a link with the alert parameters in a URL. Stay tuned for more on this.