Alert fatigue, part 2: alert reduction with Sensu filters & token substitution

In my previous post, I talked about the real costs of alert fatigue — the toll it can take on your engineers as well as your business — and some suggestions for rethinking alerting. In part 2 of this series, I’ll share some best practices for fine-tuning Sensu to help reduce alert fatigue.

Keep in mind: if you’re not using these features, it just means you have room to improve your quality of life in terms of alerts. I learned much of this the hard way, and it's also likely that new features have come into existence since you first implemented Sensu.

Let’s take a look at the various features Sensu offers to reduce the number of alerts you receive — without reducing what you're monitoring. We'll cover:

  • Setting per node thresholds with sane defaults
  • Filtering events
  • Consolidating alerts
  • Creating dependencies
  • Aggregating results
  • Tuning alerting for flapping services
  • Silencing alerts
  • Preventing alerts from nodes during the bootstrap process

Token substitution

You might be thinking, “Isn’t that what you use to specify secrets?” but token substitution has many use cases, such as setting per-node thresholds.

Take the following example of a Sensu check definition:

{
"checks": {
"check_cpu": {
"command": "check-cpu.rb -w \":::cpu.warn|80:::\" -c \"cpu.crit|90:::\" --sleep 5",
"subscribers": ["base"],
"interval": 30,
"occurrences": ":::cpu.occurrences|4:::"
}
}
}

The ":::token|default:::" syntax allows us to use a client configured value and otherwise use the default. Here’s an example of an ETL node that needs to set thresholds to accommodate its workload:

{
"client": {
"name": "i-424242",
"address": "10.10.10.10",
"subscriptions": ["base", "etl"],
"safe_mode": true,
"cpu" {
"crit": 100,
"warn": 95,
"occurrences": 10,
}
}
}

Filters

In Sensu 1.x, filters are incredibly powerful and can be used to determine if a mutator or handler is run. Both mutators and handlers are computationally expensive, so adding a filter reduces the load on the Sensu server(s), thereby reducing the alerts from the Sensu server itself. Handlers can be used for automating pretty much any process, although they are most commonly used to notify responders via communication channels such as Slack, email, PagerDuty, etc.

Filters can be inclusive or exclusive:

Inclusive filtering allows mutators and handlers to run if the condition of the filter matches. It’s controlled by setting the filter definition attribute "negate": false. The default is inclusive filtering.

The following is an example of a filter designed to only alert from 10am to 10pm Eastern Time:

{
"filters": {
"ten_to_ten_eastern": {
"negate": false, # default: false
"attributes": {
"timestamp": "eval: ENV['TZ'] = 'America/New_York'; [1,2,3,4,5].include?(Time.at(value).wday) && Time.at(value).hour.between?(10,22)"
}
}
}
}

In this example, we’re leveraging a Ruby function called eval which allows executing arbitrary Ruby. As our servers store their time in UTC, we set our timezone (to avoid having to look up timezone values) and check if the event fired during the specified days and hours. As many places in the United States have daylight saving time (which is terrible idea and no one likes it), this means we don't have to adjust our timezone settings twice a year.

Exclusive filtering prevents mutators and handlers from being executed. It’s controlled by setting the filter definition attribute "negate": true.

The following is an example of a filter designed to avoid proceeding in the event pipeline when the number of occurrences of the current state is greater than the check’s occurrences attribute value. A token substitution fallback of 60 is used if the check does not define an occurrences attribute.

{
"filters": {
"occurrences": {
"negate": true,
"attributes": {
"occurrences": "eval: value > :::check.occurrences|60:::"
}
}
}
}

I hope you found these tips for fine-tuning Sensu helpful. Stay tuned — in my next post, I’ll cover automating triage and remediation with checks hooks and handlers.

Looking for a deeper dive? Check out the next installments in this series:

  • Part 3: automating triage & remediation
  • Part 4: alert consolidation
  • Part 5: fine-tuning & silencing

DevOps Monitoring Monitoringlove