Alert fatigue, part 3: automating triage & remediation with check hooks & handlers

This is part 3 in a series on alert fatigue. Read up on parts 1 and 2.

In many cases — as you’re monitoring a particular state of a system — you probably know some steps to triage or in some cases automatically fix the situation. Let’s take a look at how we can automate this using check hooks and handlers.

Some reasons to automate this include:

  • Remove the need to wake up an operator after hours for known issues
  • Save time with repetitive common failures since it removes the need for manual intervention
  • Reduce the “tribal knowledge” required for operators to support the system; even just telling responders which logs to look at or commands to get information is useful
  • Allow operators to make a decision as to its urgency based on information automatically retrieved

Check hooks

I worked with Sean Porter (CTO of Sensu) to design check hooks, an awesome Sensu feature that allows you to run a command client side based on the status code of the check/event. This is a great way to bring contextual awareness into the alert before sending it off to your on-call engineer. One use case includes checking connectivity to your default gateway when outside connectivity has been cut from an instance.

Example:

{
"checks": {
"ping_four8s": {
"command": "check-ping.rb -h 8.8.8.8 -T 5",
"subscribers": ["base"],
"interval": 5,
"hooks": {
"non-zero": {
"command": "ping -c 1 `route -n | awk '$1 == \"0.0.0.0\" { print $2 }'`"
}
}
}
}
}

You might be thinking, “Well that's cool, and looks like it's useful to go beyond triage and jump right into auto remediation.” You would not be alone in this assumption, but check hooks have some disadvantages to other approaches. The biggest disadvantage is that check hooks run client side and therefore lack any context beyond “this was the status of the last run command.” For example, if you had a check hook that would restart the service if the process is not running, that may have unintended consequences. Let’s say you have to stop said process to perform some kind of offline maintenance (and you’re also “not an asshole”), so you decide to create a silence to prevent it from alerting the on-call engineer. The problem is, the client has no access to this context and will restart the process when it detects it’s down, which could leave you in a bad state. “OK,” you’re thinking. “That makes sense. So how do I do auto remediation properly then?”

Handlers

Let’s look for an alternate solution using automation

In Sensu, a handler is a piece of code that runs whenever the Sensu event pipeline deems it should. It can do anything (sending metrics, notifications, and for auto remediation, to name a few); let's look at how we can leverage some Sensu internals with a handler.

There are essentially two parts to the config after you’ve installed the remediator handler from the somewhat goofily named sensu-plugins-sensu gem.

Let’s start with defining our check configuration:

{
"checks": {
"check_process_foo": {
"command": "check-process.rb -p foo",
"subscribers": ["foo_service"],
"handlers": ["pagerduty", "remediator"],
"remediation": {
"foo_process_remediate": {
"occurrences": ["1-5"],
"severities": [2]
}
}
}
}
}

The majority of this check definition is pretty standard setup, so let’s focus on the portion in the remediation object. Aside from adding the remediator handler to it, we define foo_process_remediate and tell it to only run when its occurrences are between 1 and 5 and on critical (status code 2) events.

Note: if you’re unfamiliar with Sensu check definition, check out their documentation.

For the second portion, we need to leverage a not very well known feature of Sensu: the “unpublished check,” which essentially means that the Sensu scheduler will not schedule the check on an interval even if defined and will only run when an event triggers it to run. This can be used either for fully automated remediation or to create automated fixes that an engineer can fire off with an API call (after verifying the situation and believing the issue will be resolved with a particular set of remediation). This is controlled by the publish key, which by default has a value of true, meaning Sensu will automatically schedule it.

For example:

{
"checks": {
"foo_process_remediate": {
"publish": false,
"command": "sudo -u sensu service foo start",
"subscribers": ["foo_service", "client:CLIENT_NAME"],
"handlers": ["pagerduty"],
"interval": 10,
}
}
}

The command in the example is the command you wish to run to fix the issue — in this case, starting the service as the Sensu user. In a moment, I’ll cover giving limited escalated privileges to Sensu.

You might be wondering what's up with that whacky subscription of client:CLIENT_NAME. The client: represents the internal Sensu subscription that it makes for each client for every subscription designated for the client. I originally omitted this in my first implementation of this in my environment and was in for a fun surprise: when a failure was detected on one client, it restarted the service on all clients with the matching subscription name, which meant all my web servers restarted — causing a minor outage. I learned my lesson: I restricted the remediation to the affected client and it works perfectly.

We also want to specify an alert handler in case the remediation fails, like we do with PagerDuty in the example above.

In order for the Sensu process to be able to restart the foo service, we need to configure our system to allow that, as the foo service is owned by the foo user — because running everything as root is a bad idea.

We turn to the good old suders configuration to accomplish this. The command will change depending on what process manager you’re running on the system. Here’s an example for both sysv-init and systemd to allow starting or restarting select commands.

sensu ALL=(root) NOPASSWD:/usr/sbin/service service foo start
sensu ALL=(root) NOPASSWD:/bin/systemctl start collector
sensu ALL=(root) NOPASSWD:/bin/systemctl restart collector
sensu ALL=(root) NOPASSWD:/bin/systemctl start chef-client
sensu ALL=(root) NOPASSWD:/bin/systemctl restart chef-client

While you can write this into the /etc/sudoers file directly (via the visudo command), I suggest writing it to /etc/sensu/conf.d/sensu with your config management of choice to keep it clean and avoid issues when upgrading system packages. There are many resources for breaking down the syntax of sudoers config, but the short version is that we let the sensu user execute those commands without requiring a password. Be careful with wildcards as this can lead to argument expansion attacks.

To continue setting up remediation, you'll need to define the handler as well as some pieces to the client.json configuration. Assuming the plugin lives in /etc/sensu/plugins/sensu-remediator.rb, the handler config should be:

{ 
"handlers": {
"remediator": {
"command": "/etc/sensu/plugins/sensu-remediator.rb",
"type": "pipe",
"severities": [
"critical",
"warning"
]
}
}
}

The handler must include the subscription we previously defined:

{ 
"client": {
"address": "10.10.10.10",
"name": "i-424242",
"safe_mode": "true",
"subscriptions": [
"base",
"foo_service"
]
}
}

In part 4 of this series, I’ll go into alert consolidation, which helps responders focus on what's important.

Editor's note: looking for more alert fatigue relief? Skip ahead to part 5 (fine-tuning & silencing).

 

Monitoring Monitoringlove DevOps