Community roundup: making the switch from Nagios to Sensu

Wherever you’re at in your monitoring journey, you’ve probably used Nagios at one time or another. Love it or hate it, a legacy tool like Nagios played a critical role in establishing monitoring as a practice and helped train a generation of operators who required visibility into system dependencies and performance.

Photo by Christina Kirschnerova on Unsplash

But Nagios preceded modern infrastructure. While it may work well for legacy setups, it can’t keep up with the demands of a cloud- and container-based world, where ephemeral infrastructure is the new normal.

Sensu has become the de facto Nagios alternative in large part because of our support for the Nagios service check from the outset. Plus, our monitoring event pipeline is flexible, scalable, and easy to share, and can be easily integrated into existing workflows without sacrificing speed, reliability, or security.

For this blog post, I’ve collected stories from Sensu Community members who’ve made the switch from Nagios to Sensu, with details about the challenges they needed to solve, the benefits they’ve gained, and why they’ll never look back.

Shifts in technology necessitate next-gen monitoring

With better technology comes greater organizational complexity. Monitoring teams are now largely responsible for infrastructure and applications distributed across bare metal, public and private clouds, and virtualized environments. Greater complexity in turn drives exponential system events, signals, and errors.

And with DevOps practices becoming more mainstream and IoT devices on the rise, it’s no wonder that operators are looking for solutions that are flexible and future-proof.

The key lies in finding a monitoring solution that can be automatically configured to support the inherently ephemeral nature of cloud- and container-based environments. Nagios, for example, is not designed to adapt to change in cloud environments. Sensu, on the other hand, is built to automatically register new instances when they’re added and deregister them when they’re removed. As this this TechTarget article puts it, “While the cloud’s changing environment overwhelms Nagios, Sensu is simpler and more scalable.”

We help you take it a step further by applying workflow automation principles to monitoring — so you can connect monitoring data to the systems and tools you already use, and ultimately spot (and fix) problems before they impact your users.

The challenges of Nagios

When we ask folks what they look for in a Nagios alternative, their answers tend to start with a pain point. As Pete Cheslock writes in his Happy Birthday Sensu post, “I was laying down my normal daily diatribe about my hatred for all things Nagios and posed a simple question to the team, ‘what should we do now?’ It was painfully obvious that Nagios was not right for our infrastructure.”

Andy Sykes, SRE at Google, says, “You can’t scale [Nagios]. Every check you add adds load to your Nagios master… The more checks you have, the [worse] the scheduling gets.” (Editor’s note: the actual quote is a bit NSFW — in fact a testament to how some folks feel about Nagios — so we opted to swap for “worse” here.)

For Jason Anderson, Sr. Insights Engineer at Willis Towers Watson, “It’s all about finding a monitoring solution that can keep up with modern infrastructures and provide the ability to customize your workflows.”

Andy Repton, engineer at Schuberg Philis, says, “As we started looking into more dynamic infrastructure (where our systems may only exist for days or even hours before being replaced), the amount of time required to regenerate and manage the configuration [with Nagios] became challenging.”

Drew Rogers of Chariot Solutions notes that the flexibility Nagios provided was awesome (especially as compared to its more expensive competitors), but it started to give him issues once he migrated his infrastructure to Chef (Trent Baker at Box.com also ran into issues using Nagios with Puppet, their configuration management tool). Drew tried out Sensu as part of his mission to find a Nagios replacement and, “After just two days of testing, I was ready to Old Yeller Nagios.”

In each of these cases, the teams came up against limitations of Nagios. Their infrastructures and systems had evolved, and their monitoring solution needed to evolve too. They started evaluating new monitoring solutions — and ultimately chose Sensu — to achieve full-stack monitoring that could support both ephemeral and multi-generational infrastructure.

In hearing from the Sensu Community and industry veterans, the reasons for choosing Sensu tend to coalesce around the following themes:

  • Scalability
  • Reliability
  • Ease of use
  • Shifts in team(s) and culture

Let’s take a closer look at each.

Why Sensu

Scalability

A common refrain that we hear about Nagios is that it just doesn’t scale. This pain point can feel particularly, well, painful in environments that are highly customized.

At Box.com, Sr. Infrastructure SRE Trent Baker is responsible for an infrastructure that had reached 16,000+ globally distributed compute nodes with 350,000 Nagios objects — and because their Nagios installation was tightly coupled with Puppet, each change made to the environment also had to go through Puppet.

This customized setup worked well for Box.com in its early years, but Trent realized it was reaching its limits. It took hours for changes to propagate through the infrastructure; the Nagios leader was a huge bottleneck and a single point of failure.

Trent chose Sensu because it would be much easier to deploy, scale, and maintain. His team migrated approximately 1,250 Nagios checks running over 16,000 hosts for their production environment, and another 3,000 servers for dev and staging. They deployed Sensu to five different datacenters, scaling Sensu significantly to easily grow horizontally (see their architecture diagram for the full picture).

Andy Repton at Schuberg Philis faced similar challenges when his team introduced Kubernetes in production. He explains, “As we expanded into containerised workloads we needed a powerful, scalable monitoring solution that would not only support this new world but our existing virtual and physical infrastructure as well.”

Their monitoring solution needs to process around 65,000 events per day, and over 5,000 individual checks — which proved too much for Nagios to handle.

Andy’s team chose Sensu as the monitoring solution for all, and worked through some interesting challenges (see their installation process in this post). He’s pleased to report that Sensu has scaled to meet the needs of both internal teams and external customers, while “also helping us meet our 100% uptime guarantee.”

Scalability was also a concern for Kale Stedman of Activision. Their platform supports millions of gamers distributed across thousands of nodes. According to Kale, they had over 100,000 services across five difference Nagios instances; the pains around maintenance caused team morale to drop, so they sought a replacement. As Kale notes, “We took a look at a ton of different offerings in the space and to be honest, there was only one answer: Sensu. It supports Nagios checks out of the box, it's incredibly extensible, and it's designed from the ground up for scalability."

Reliability

Uptime, of course, goes hand in hand with how reliable your systems are. You want systems that do what they’re supposed to do, every time — and work even when you’re not around.

Reliability is a top concern at GoDaddy, where a central monitoring team used to be responsible for the full stack — installing agents, configuring checks, responding to internal customers who needed monitoring for their services. And those customers were waiting … “forever,” says SRE Michael McLane, formerly of GoDaddy. Plus, the monitoring team was overwhelmed with tasks when something went wrong.

It was a bad experience for everyone, Michael says. They knew they needed a change, and decided to introduce self-service monitoring via Sensu.

Now, GoDaddy uses Sensu to perform organization-wide health checks (“It’s just there; it’s just on. No team needs to request it.”), and team checks, where teams use their own knowledge about what they need to create their own Sensu checks.

They’re able to maintain reliable systems and their internal customers have greater confidence in what’s working — and what needs to be fixed.

Ease of use

Even if you realize it’s time to ditch Nagios in favor of better scalability, reliability, and security, sometimes the sheer effort involved in replacing existing toolsets can feel like a major barrier to entry.

You can’t simply turn off the lights on your existing infrastructure to do a rip and replace.

In fact, many of our customers gravitate to Sensu because you can get up and running quickly with little to no interruption to existing services. As noted earlier, Sensu supports the Nagios plugin specifications, so you can leverage the service checks you’re already using. Drew Rogers also notes the familiarity of Sensu plugins to folks who are used to creating their own Nagios plugins: “In fact, your Nagios scripts and plugins will work with Sensu without any changes!”

David Schroeder, Cloud Engineer at Viasat, took a wise approach when upgrading to Sensu: he set up a fresh Sensu cluster that reached parity with his existing solution and discovered that existing service checks — including some custom 500-line Perl scripts — could be moved over verbatim.

As with any migration to a new tool, moving to Sensu isn’t without its challenges, but a lot of the heavy lifting is taken care of; not only does Sensu support the plugins you’re used to, it also integrates with tools you already use, like InfluxDB, Prometheus, ElasticDB, Grafana, and more.

Shifts in team(s) and culture

Another considerable benefit of migrating to Sensu? Happier, better-functioning teams.

At GoDaddy, Michael McLane says forging the self-service route with Sensu initially raised some eyebrows. Some teams didn’t love “having to do the work themselves.” But now, “teams have embraced monitoring as a piece of their destiny.” They’ve taken ownership of their services, and it’s created an environment of increased positivity and trust.

Jason Anderson notes that Sensu’s pubsub (“publish-subscribe”) model gives his team the flexibility to customize alerts so the right people can pay attention to the information they need. As Jason says, “With Nagios it [was] very manual, whereas Sensu is naturally set up to easily customize alerts.”

Switching to Sensu may also help clarify appropriate workflows and metrics. Andy Repton says,

“A large challenge [for us] was identifying the right metrics to check. Some of our teams didn’t care about CPU usage or memory alerts… To make the solution flexible we use a Kubernetes config map to allow each team to determine the checks they would like enabled.”

Improving workflows can impact a team’s ownership and efficiency, as well as their job satisfaction. Reducing alert fatigue means less burnout and better employee retention.

Better for operators, better for business

Of course, we love hearing success stories from customers who have migrated from Nagios to Sensu, but it’s even better hearing about better knowledge sharing, reductions in alert fatigue, and improvements in operator happiness.

We’re delighted to partner with our customers on this journey, to see operators get excited about their work, and to hear about team and organizational victories big and small. We invite you to join our Community Slack and Discourse to chat with like-minded folks in the ops and monitoring worlds, ask questions, and share knowledge. And, if you have a story to tell, please drop us a line at community@sensu.io.

 

 

 

 

 

Nagios Cloud Computing Customers