Companies that care about uptime generally use pager duty & have an on-call rotation. Instructions to handle common incidents are maintained on internal documents called runbooks. These are written in plain English & typically maintained on an internal wiki. First, let’s see some challenges with current form of runbooks:
- You need to manually execute each step, no automation
- It’s quite an effort to get everybody onboard to keep the runbooks up to date
- Unless well written, there can be ambiguity/confusion in following instructions
To tackle some of these problems I am proposing a use of novel data science tool, Jupyter Notebooks. Notebooks are a unique combination of markdown text, executable code, and output all within a single document served in a browser. This combination of features fits the runbook need very well. Let’s see with an actual example. Here’s a before & after picture of simple Gitlab runbook converted into Jupyter Notebook. (kudos to Gitlab for making their runbooks public).
As you can see on the right, investigative steps to SSH & check if Kibana process is running are codified as bash scripts that run directly from the Notebook. Fix to restart Kibana & verify that it’s running is also codified. Next time, on-call can investigate & fix the issue directly from Notebook instead of ssh’ing manually, and copy-pasting commands.
Even though the example only shows bash script, Jupyter Notebooks are capable of handling all major programming languages with first class support for Python. So imagine all the things you do to investigate & fix the incident are available to execute at a click of a button, all from within a single browser page. No setup needed.
- We can pull in latency & traffic graphs into the Notebook
- Make API calls to manage any cloud resources/services
- Run some SQLs to see slow/long running queries, vacuum status etc.
- SSH onto a machine, restart processes, run any other commands, see output
- Check last deployment time & even rollback the changes (assuming API support)
Here are some benefits of maintaining your runbooks in executable Notebook format.
- Less confusion. Code is much more deterministic than instructions written in English.
- Reduces the incident time & impact. On-call responds faster with code required to investigate/fix an issue at her fingertip.
- Automate at your own pace. Since Notebook supports markdown, it’s possible to just import existing runbooks as is & automate a few steps every sprint.
- Better collaboration. It provides a first class platform for sharing all the tribal knowledge, local scripts that developers keep to combat an issue.
- There’s real power when we combine individual steps to build more complex logic. Following is possible today.
Executable Notebook format is promising but here are some challenges with current Notebook implementations.
- Typical Jupyter installation is single user local setup requiring the Jupyter server to be running locally. This isn’t ideal. We don’t want on-call to deal with the runbook server when she is already stressed with the incident. We need a remote runbook setup that’s accessible with browser & ready to use anytime.
- Google Colaboratory could have been a good choice but it’s hosted on Google servers. The Notebook server needs to be self hosted for the code to have access to all the infrastructure within our VPC.
- Any infrastructure code will require credentials, ssh keys etc. We need a way to share them safely & not just stick it everywhere in the Notebook code snippets.
I’m building Nurtch, a platform that tackles these challenges & provides an easy way to maintain executable runbooks within team. Docs provide a complete overview of Nurtch capabilities and how-to’s. Let me know what you think of this approach to managing runbooks.