Uptime Conference is a conference on “ops and systems programming” that I attended this fall. The conference was heavily focused on operations and DevOps. People whose role was “Developer” were a definite minority, crowded out by titles like “DevOps Lead” and “Operations Engineer.” This crowd understands the importance of practices like post-incident reviews, logging, and telemetry. Today I’ll try to make some concrete suggestions from the information I learned at this conference. This information comes both from the talks and from asking operations folks, “What do you wish developers knew?”
First, about post incident reviews: Construct timelines. This suggestion comes from Jason Hand at VictorOps. Timelines provide a different way of understanding what happened. For example, we might view a timeline and see that a support person (who noticed a problem) reported it by sending Slack messages to a developer they knew. We can then start to address interesting questions such as: “Was that the right person to notify about the incident?” or “How should people decide which engineer to contact when they have a problem?”
Second, also about post incident reviews, J. Paul Reed talked about understanding failure. I think the big takeaway here was: “Treat near-miss successes as failures.” For example, if you have a manual testing step that you do after running all of your automated tests, and that manual step catches something critical, think of this as a near-failure, debrief it and try to figure out how to prevent it again.
Beware of Bias
Third, is to beware of different biases, especially the hindsight bias. The hindsight bias, which is our tendency to believe, after an event has occurred, that it was predictable, can be particularly insidious in post incident reviews, because it can lead to asking questions like “Why didn’t you think of that?” which places blame and doesn’t help. I think making timelines of who did what, and when, can actually help combat these biases a lot, because the timeline helps us see the event unfold as a series of concrete decisions that particular people made with the information they had at the time, rather than as an aggregate failure by the team.
Consider the Long Term
Fourth, a great quote from Bridget Kromhout: “Day 1 is short – day 2 is forever.” In other words, getting an app deployed is not even half the battle. Keeping it running day after day is a lot of work too, and having good logging and monitoring can go a long way with that. There were two concrete pieces of advice here:
- Think about different failure modes. (For example, if you depend on an external integration, do you know what your app does if that service is down? What about if it’s available but unacceptably slow?)
- Strive for observability and debuggability. When the application is in production, will we be able to answer questions we don’t know to ask now?
These two pieces of advice make “day 2” (i.e., “the whole time the app is in production and maintained”) easier and more predictable. Some teams are surprised when a hard drive fails or when the network is slow or when a third party changes their API, but on a long enough timescale, these events are inevitable. Thinking about the failure modes ahead of time helps you know what to do when you need to recover.
The next piece of advice, to strive for observability and debuggability, is equally important. If you have good logging and metrics, then when some unanticipated error starts to happen in production, you will (1) notice before the support tickets pour in and (2) be able to remediate the issue quickly. It’s frustrating to have a bug open for hours or days while you write more logging code to figure out what the bug even is. Good visibility into your app lets you answer questions about the production server that you didn’t think of during development.
I think the advice I received at this conference was quite valuable, and I know I’ll be putting it into practice on the teams where I participate.