On the merits of pubsub & workflows (or, why Sensu over Nagios)

This is a guest post by Sr. WTW Insights Engineer & Sensu Community Member Jason Anderson. Have your own story to share? Send us a note: community@sensu.io.

Not too long ago in the Sensu Community Slack, the question: “Why Sensu instead of Nagios?” arose. Specifically, “How do I convince my boss to choose Sensu over Nagios?” I responded to the thread, but decided it was worthwhile to share my response with the wider community. At Willis Towers Watson, we moved from Nagios to Sensu 1.2 almost a year ago (and now we’re upgrading to Sensu Go). In this post I’ll share what we learned and why we migrated (and why you should too).

senzu beanSensu bean, that is.

Matching modern technologies with the pubsub model

First of all, Sensu uses the pubsub (AKA, “publish-subscribe”) model for subscription checks: because Sensu uses a message bus for communication, you can publish messages to a specific “topic” and consumers can subscribe to one or more specific topics. This is in stark contrast to Nagios, where, in order to manage who gets alerted on what, you have to set up the command blocks, configure each host to use the service templates or write some sort of config management. Said another way: with Nagios it’s very manual, whereas Sensu is naturally set up to easily customize alerts. With Sensu, you set up all your checks (commands) in JSON files and assign them to a subscription name; after that, it’s just a matter of when you bootstrap the sensu-agent on the remote host and assign it to the subscriptions you want to use.

Why is this important? The thing about legacy monitoring tools like Nagios is they were designed for infrastructure that very rarely changed. However, in terms of today’s technology, that’s becoming increasingly untrue. With the advent of Infrastructure as Code (IaC) and container- and cloud-based infrastructure becoming the norm, the challenge of keeping monitoring up to date and cleaning up stale services and devices has become a large burden for many companies.

Now, we can easily spin up new infrastructure and — with Sensu — programmatically assign subscriptions to known checks.

Monitoring workflows

In traditional monitoring, we’d watch for an event and do an action based on that event — all of which is extremely manual and requires constant vigilance, in turn contributing to alert fatigue. Approaching monitoring as a workflow can help: with a set of building blocks in place, you can customize your own workflow and automate it. Sensu has expanded on this concept with a couple tools such as check hooks, mutators, and handlers, each of which plays a specific role:

  • Check hooks in Sensu 1.x are a great way to gain additional context into an event. By assigning a hooks key in your check JSON, you can run any arbitrary command on the remote host to gain additional information. For example, say you're using a disk_check plugin. This plugin takes two arguments warning and critical and it runs a WMI call or Linux subsystem command to give you available space on your drives. You can set up a hook to say for any non-zero or for any status code of 2 (meaning critical) run another command. e.g., du -a / | sort -n -r | head -n , 5 which finds the five largest files in a path sorted by largest. Learn more about check hooks in the Sensu docs. Bonus: Sensu Go makes check hooks even easier by making them a reusable block that can be shared between checks! 
  • Mutators are interesting, as they can actually inspect information in the event data and change the event payload based on parameters you set up. Say, in the case of some keepalive failure or disk space event, you need a different team to handle the event, and you want to change which handler receives the event. Or, say you’re using PagerDuty and you want to automatically move an event to a lower priority based on a custom status code you’ve set up in your check script/plugin. All of this becomes possible with mutators. Of course this also gives you the ability to obfuscate data in your event or change the output data block in any way — for example, you can sanitize data before sending it to an external service, or add additional context by running server-side infrastructure queries that client doesn’t need to run locally. Learn more about mutators in the Sensu docs.
  • Handlers are where workflows live for Sensu. A handler's job is to take an event (the check definition defines which handler to use) and route the payload to its final destination. I've re-written a custom handler for PagerDuty in Python where, based on a key in the check definition, I can:
    • Tell the handler which PagerDuty integration key (PagerDuty service) to route the event to.
    • Choose to not forward an event based on how many times it's occurred within a timeframe to help mitigate alert fatigue.
    • Inspect the status code and choose to send an email instead of an alert.

As you can see, the options are quite numerous. One additional thing to note: there is a concept of handler sets, where a check defines handlers in a list and they work on the event data in concert. The way it is in Sensu 1.x, they don't fire in the order they are in the list. But there is a Community Plugin that solves this: sensu-handlers allows you to define handlers which could act as remediation steps with much more power and flexibility than hooks because it has the ability to work within the bounds of the event data. Basically: if status code 3, then use handler A; if handler A is successful, change status code 0; else change status code 4, next handler. Learn more about handlers in the Sensu docs.

I hope this helped answer at least part of the question around why Sensu versus Nagios. Ultimately, it’s all about finding a monitoring solution that can keep up with modern infrastructures and provide the ability to customize your workflows. Questions or comments? Hit me up in the Sensu Community Slack: @Darth Scrumlord.

Nagios Monitoring Monitoringlove