Alert fatigue, part 5: fine-tuning & silencing

This is part 5 in a series on alert fatigue. Catch up on parts 1, 2, 3, and 4.

By now you’ve learned about reducing the sheer amount of alerts you’re getting as well as automated triage and remediation. In this post, I’ll go into some extra steps you can take to further fine tune Sensu and cut down on alert fatigue.

You’ll learn about:

  • Flap detection, or detecting hosts and services that are "flapping," AKA changing state too frequently
  • Silencing the checks and clients you know you're addressing
  • Safe Mode, which reduces alerting on non issues
  • Extending handler configurations, AKA customizing Sensu's default handler configs

Flap detection

Honestly, I should know more about flap detection, but in my experience it’s more of a tune by instinct and observations versus pure math. Sensu uses the same flap detection algorithm as Nagios.

There are two levers to tweak until happy:

{ 
"checks": {
"check_cpu": {
"command": "check-cpu.rb -w 80 -c 90 --sleep 5",
"subscribers": ["base"],
"interval": 30,
"low_flap_threshold": ":::cpu.low_flap_threshold|25:::",
"high_flap_threshold": ":::cpu.high_flap_threshold|50:::"
}
}
}

As with other settings, you can use default thresholds and override specific clients with different workloads.

Silencing / Maintenance

Maintenance is part of our everyday lives, and while we strive to always provide a zero downtime maintenance, sometimes it’s unavoidable. Be a good citizen on your team and silence the checks and clients you know you’re updating to avoid alerting the on-call engineer. Failure to do so may result in your teammates being unhappy with you and branding you an “asshole” (or worse). Sensu provides an API for silencing subscriptions and checks, and in version 1.2 on, they allow you to specify a start time for your scheduled maintenances.

Maintenances should typically start with:

$ curl -s -i -X POST \
-H 'Content-Type: application/json' \
-d '{"subscription": "load-balancer", "check": "check_haproxy", "expire": 3600, "begin": "TIME_IN_EPOCH_FORMAT", "reason": "Rolling LB restart" }' \
http://localhost:4567/silenced

HTTP/1.1 201 Created
Access-Control-Allow-Credentials: true
Access-Control-Allow-Headers: Origin, X-Requested-With, Content-Type, Accept, Authorization
Access-Control-Allow-Methods: GET, POST, PUT, DELETE, OPTIONS
Access-Control-Allow-Origin: *
Connection: close
Content-length: 0

The above curl command illustrates how easy it is to create a silence. Please note the expire key: never submit this without a specific deadline, as it will surely come back to bite you later. I have seen it happen where we had an impact but no one was alerted. I recommend silencing for no more than 24 hours at a time. You can also leverage the expire_on_resolve key which does what it sounds like.

When you set a maintenance/silence without an expire you leave yourself open to this.

Of course, you can also use Uchiwa to schedule silences if you are one of those GUI-inclined folks.

Safe Mode

This is a feature I had to sadly omit from the talk to make it fit in the time allotted. It’s honestly more of a security feature, but has a useful side effect for helping with alert fatigue. Safe Mode informs the client that it may only execute a subscription check from the server if the check definition on the server exists as well on the client side. This helps protect against an attacker with a foothold in your environment from using Sensu to execute malicious checks and spread to other nodes. Wondering how this relates to alert fatigue? Let’s say you have a process where machines start from a base image and then have additional provisioning tools such as Chef, Puppet, Ansible, etc. that bring the node into the desired state. That process may not be instantaneous: when the Sensu client sees the node and matches the subscription, it starts scheduling checks immediately, perhaps before the system provisioning has finished updating the check definitions, monitoring plugins, or other services required to satisfy check requirements. Safe Mode makes sure that we prevent checks, mutators, and handlers from firing until the definition is set up, which reduces the window of opportunity to alert on a non issue. It’s a great feature that solves multiple problems at once.

Configuring Safe Mode is quite easy: you just enable the following in your client file (typically located in /etc/sensu/conf.d/client.json):

{ 
"client": {
"name": "i-424242",
"address": "8.8.8.8",
"subscriptions": ["dns_lb"],
"safe_mode": true
}
}

And then add your check definitions to the server and the appropriate clients.

Handler config

Technically Sensu comes with good defaults for handler configuration — here’s an example of modifying some defaults that would make a particular handler action on additional events:

{ 
"handlers": {
"single_pane": {
"type": "pipe",
"command": "single_pane.rb --message 'sensu event' https://domain.tld:port",
"handle_silenced": true,
"handle_flapping": true
}
}
}

In this scenario, we’re not alerting or remediating anything and are using a single pane of glass service (such as BigPanda), so therefore we want to receive flapping and silenced events.

Closing thoughts

I hope that you found these tips useful for reducing your alerts and improving your engineers’ happiness at work. And, while this series offers a curated tour of Sensu capabilities targeted at reducing or eradicating alert fatigue, there are a lot of other great Sensu features to explore. I wrote this series in context of Sensu 1.x, but many of the features have been moved into Sensu Go and in many cases improved upon. I hope to go over these in a future post, but the one feature that changes drastically in its power and ease of use is filters, as you can’t leverage Ruby’s eval or otherwise similar function, since golang is a compiled language. Work is being done by the Sensu Community and engineering team to make this both easier and better to use, and you can accomplish most of the same things writing a gRPC client.

Stay tuned for how to reduce alert fatigue with Sensu Go, and for now — happy monitoring!

Monitoring Community DevOps Monitoringlove