Filters: valves for the Sensu monitoring event pipeline

Filters, the mechanism for allowing and denying events to be handled, have been given a refresh in Sensu Go. These new and improved JavaScript filters give you a way to express business logic through filtering, giving you more awareness of your environment, reducing alert fatigue, and improving performance.

In this post, I’ll share what’s new with filters, using examples in Sensu Go. (If you haven’t already, you can download it here.)

Service checks and keepalives are represented as events in Sensu installations. These events can occur in the presence of successful executions or execution failures. Events contain output information, as well as history about past execution attempts.

You can associate handlers with events — handlers are executed on the backend and given the serialized (or, mutated) event as input.

---
type: Handler
api_version: core/v2
metadata:
name: keepalive
namespace: default
spec:
command: /usr/local/bin/pagerduty.sh
type: pipe

Figure 1: A keepalive handler that contacts PagerDuty

The handler in figure 1 will fire on keepalive events, presumably notifying PagerDuty that something has gone wrong. But, if you’re an operator deploying this handler you’ll soon realize something is amiss: PagerDuty is being invoked with every successful keepalive event. If you’re a Sensu 1.x user, this might catch you by surprise; events in Sensu Go are always handled by default, whereas events in Sensu 1.x are only handled on failure conditions.

When designing Sensu Go, we decided that all events should be handled. This allows users to more easily handle metrics events, and also allows for more expressive incident resolution workflows.

Clearly, not all keepalive events are worth knowing about. Most operators would probably prefer to only be notified when a keepalive failure occurs, or keepalive failures stop occurring. So what is the operator to do? Events can be filtered with EventFilters, in order to prevent handlers from running.

Why event filters?

You might ask: “Why bother with filters as a separate concept from handlers? Couldn't handlers encode the same logic as filters, electing not to continue with their task given some condition?”

The answer to this question is twofold:

  1. By using filters as a separate building block, generic handlers can be created and re-used without concern for filtering business logic.
  2. Event filters do not require starting a child process, or issuing any system calls, and so are much more performant than handlers.

The design of event filters encourages you to aggressively cull events that don't need to be handled. You can create a handful of filter objects and apply them across the entire event pipeline, for any type of event. This approach ensures that Sensu backends don't get overloaded, even in the face of thousands and thousands of simultaneous events.

For more on using filters to combat alert fatigue, check out this blog post: https://blog.sensu.io/alert-fatigue-part-2-alert-reduction-with-sensu-filters-token-substitution.

Basic event filtering

In figure 2, the first example was modified to include an is_incident filter, built-in to Sensu Go. 1.x users that are accustomed to the old handler behaviour can use this filter. Once applied, only keepalive failures and resolutions will be handled by the PagerDuty handler.

---
type: Handler
api_version: core/v2
metadata:
name: keepalive
namespace: default
spec:
command: /usr/local/bin/pagerduty.sh
filters:
- is_incident
type: pipe

Figure 2: A handler that uses a built-in event filter to only allow incidents to be handled

While this might suffice for some operators, others may wish to be more tolerant of failures, expecting that failures could be ephemeral, or remediated elsewhere. To facilitate more nuanced filtering use cases, Sensu Go provides the ability to define custom event filters.

---
type: EventFilter
api_version: core/v2
metadata:
name: failing-filter
namespace: default
spec:
action: allow
expressions:
- event.check.status == 0
- event.check.occurrences >= 3 && event.check.status != 0

Figure 3: an event filter that only fires if the event is a success, or if three consecutive failures have occurred

In figure 3, we’ve defined a custom event filter that will allow events that are successful check executions, or that are failures that have occurred at least three times in a row. The PagerDuty handler is modified accordingly:

---
type: Handler
api_version: core/v2
metadata:
name: keepalive
namespace: default
spec:
command: /usr/local/bin/pagerduty.sh
filters:
- failing-filter
type: pipe

Figure 4: a handler that uses the failing-filter event filter

Note that the filter expressions have an implicit event variable. This event object is the same one that you would see if you issued a command like sensuctl event info router-1 ping --format json.

💡 Pro tip

  • Use an existing event object as reference when forming filter query expressions.

Filter actions

Sensu filters can have an action of either allow or deny. These actions specify how a true evaluation of a filter expression should be handled. With allow , if any expression evaluates to true, the handler will be executed. With deny, if any expression evaluates to true, the handler will not be executed.

Since the previous few sentences might be a little too much like Alice in Wonderland for some people, the following table illustrates the difference in another way:

Figure 5: how to think about lists of filter expressions in terms of boolean logic

chesire catSensu filter expression language

Sensu filter expressions are ECMAScript 5 (JavaScript) expressions that return a boolean value. They are executed by the sensu-backend in a VM for every event. These script executions are reasonably performant; you can expect to be able to handle thousands per second on a modest server.

Sensu filter expressions can be any JavaScript expression — as long as the result of executing the code is a boolean value of true or false.

Here’s a somewhat contrived (and complicated!) example for the sake of demonstration:

---
type: EventFilter
api_version: core/v2
metadata:
name: avg-filter
namespace: default
spec:
action: allow
expressions:
- |
(function () {
// countFailure is a callback that counts the number of checks that have
// non-zero status.
var countFailure = function (count, check) {
if (check.status != 0) {
return count + 1;
}
return count;
};

var history = event.check.history;

// avgFailure is the percentage of failures in the event's history
var avgFailure = history.reduce(countFailure, 0) / history.length;

return avgFailure >= 0.5;
}());

Figure 6: A Sensu filter that succeeds when 50% or more of events have resulted in failure

Filter assets

As you can see with the example above, the complexity of these inline filter expressions will grow to be unmaintainable. It's no good to have pages and pages of JavaScript embedded in a YAML or JSON document. If you find yourself in this position, we recommend you take advantage of filter assets. Assets are shareable, reusable packages that make it easy to deploy Sensu plugins — see the docs for more info.

Filter assets are similar to assets for checks or hooks, but only apply to the event filtering pipeline. The filtering subsystem scans the lib directory of the asset to see if there are any files ending with .js. These files are loaded into the script execution context.

Figure 7: the structure of a JavaScript asset tarball

function averageFailure (event, threshold) { 
// countFailure is a callback that counts the number of checks that have
// non-zero status.
var countFailure = function (count, check) {
if (check.status != 0) {
return count + 1;
}
return count;
};

var history = event.check.history;

// avgFailure is the percentage of failures in the event's history
var avgFailure = history.reduce(countFailure, 0) / history.length;

return avgFailure >= threshold;
}

Figure 8: averageFailure.js

Once the asset is all packed up and hosted somewhere it can be retrieved, we can create an asset definition for it, so that it can be used in the filter.

---
type: Asset
api_version: core/v2
metadata:
name: average-failure
namespace: default
spec:
sha512: e0fa3243d
url: https://example.com/average-failure.tar

Figure 9: an example asset definition for average-failure.tar. Note that the sha512 checksum is truncated, and would normally be longer.

---
type: EventFilter
api_version: core/v2
metadata:
name: avg-filter
namespace: default
spec:
action: allow
runtime_assets:
- average-failure
expressions:
- averageFailure(event, 0.5)

Figure 10: an EventFilter that uses the average-failure asset

Our filter is looking a lot less unwieldy, now that the business logic of averageFailure has been packed away in an asset! Additionally, we've made the function reusable by using a threshold parameter for failure.

Testing

"Well this is all well and good," you might say. "But how do I test all this? I keep writing filter expressions that have bugs, and it's hard to debug them in the live system."

There isn't any single best answer to this question, but there are some strategies you can adopt to ease testing and avoid bugs. What you don't want to do is write a complex filter, assume it works, and deploy it into your Sensu installation.

To test that a particular expression is working, write some test cases in JavaScript.

function testAverageFailureLessThan50 () { 
// This event is only a stub of a real event
var event = {
"check": {
"history": [{"status": 0}, {"status": 1}, {"status": 0}, {"status": 1}, {"status": 0}]}}

var result = averageFailure(event, 0.5);

if (result) {
console.log("averageFailure < 0.5: FAIL - failure is 40%");
} else {
console.log("averageFailure < 0.5: PASS");
}
}

function testAverageFailureGreaterThan50 () {
// This event is only a stub of a real event
var event = {
"check": {
"history": [{"status": 1}, {"status": 1}, {"status": 0}, {"status": 1}, {"status": 0}]}}

var result = averageFailure(event, 0.5);

if (!result) {
console.log("averageFailure > 0.5: FAIL - failure is 60%");
} else {
console.log("averageFailure > 0.5: PASS");
}
}

testAverageFailureLessThan50();
testAverageFailureGreaterThan50();

Figure 11: basic tests for JavaScript filter expressions

You can use a web browser's JavaScript engine to execute these tests (https://developers.google.com/web/tools/chrome-devtools/console/), or the Otto VM itself — the library ships with a basic REPL that can be easily installed if you have the Go compiler installed.

$ go get github.com/roberkrimen/otto/otto
$ otto tests.js

Figure 12: installing the Otto REPL and running some tests with it

There are many test frameworks available for JavaScript, but most of them depend on NPM, which is too large a topic to explore in this blog post. Ask a JavaScript developer friend for help if you want to up your testing game!

Going forward: share your assets with the community

Sensu event filters can provide very precise control over what events get handled, but some of the more advanced filtering patterns will require you to get your hands dirty. With filter assets, that should be a bit easier! We encourage you to share useful filter assets with the community, whether that’s via our Community Forum or Bonsai, the Sensu asset index.

DevOps Sensu Go How To Assets