Companies that care about uptime generally use pager duty & have an on-call rotation. Instructions to handle common incidents are maintained on internal documents called runbooks. These are written in plain English & typically maintained on an internal wiki. First, let’s see some challenges with current form of runbooks:
- You need to manually execute each step, no automation
- It’s quite an effort to get everybody onboard to keep the runbooks up to date
- Unless well written, there can be ambiguity/confusion in following instructions
To tackle some of these problems I am proposing a use of novel data science tool, Jupyter Notebooks. Notebooks are a unique combination of markdown text, executable code, and output all within a single document served in a browser. This combination of features fits the runbook need very well. Let’s see with an actual example. Here’s a before & after picture of simple Gitlab runbook converted into Jupyter Notebook. (kudos to Gitlab for making their runbooks public).