alerting 098 - the basics

May 17, 2024 | 2 minutes read

Someone asked today, roughly:

I need some advice on getting a handle on our alerting situation. Right now, it feels like a complete mess. We’ve got all our alerts just firing into a couple of Slack channels, and it’s an absolute firehose of chaos. Huge blocks of text getting dumped in left and right, no sense of organization or being able to follow what’s going on. Messages with no clear ownership or appropriate team tagging.

It’s become impossible to get any clarity on whether issues are getting resolved or not. I’ll scroll back through and see the same errors repeating over and over with no indication that anyone is even working on them. We need a better way to bring some sanity to this process. What are others using to get alerting under control? I’m open to suggestions here, because our current approach is just not cutting it.

Alerting 098 - The Basics:

  • If it’s not actionable, turn it off (so turn them all off).
  • If it’s actionable:
    • Send it to a human, not a sorry excuse for a log file slack channel.
    • Send it to the right human.
    • Include enough context to allow them to take action, don’t make them dig for details (include contextual information that triggered the alert, not just “hey, it’s red”).
  • If it IS actionable, and you perform the same action in response, automate the action and turn off the alert (change the threshold to catch it when the automation can’t handle it).
  • Stuff that’s not immediately actionable isn’t an alert - but that doesn’t mean it’s not valuable. Those are what dashboards, metrics, and log systems are for - correlation and history. Alerts aren’t for that.

After that? There’s a lot more you can do. But this is your starting point, this is how you turn the chaos of alerting from a nightmare into a controlled and manageable tool actually helping your team.