Book Review: Site Reliability Engineering

Dashboard showing application health trends. — A dashboard to monitor the health of a service. Photo by Luke Chesser on Unsplash

Book Review: Site Reliability Engineering

I’ve just finished reading the 2016 book Site Reliability Engineering: How Google Runs Production Systems. I’ve spent 20 years as a developer and I’ve appreciated the DevOps philosophy, but this is the first “Ops-heavy” book I’ve read. DevOps has always suggested to introduce Ops into the development cycle rather than only at deployment time. This book told me what’s going to happen when that Ops focus shifts left.

The book starts with the reminder that a typical application spends 20% of its life in development and 80% of its life in maintenance. The call to action is to act like it – focus on maintenance. Google proclaims that they view maintenance through a developer’s lens.

The central insight is that manual intervention is linear and does not scale well. There are plenty of possible exponentials possible in production: number of services, volume of traffic, or resource consumption. Demand can easily double; at Google scale demand might spike 1000x or more temporarily or permanently. Applications and processes cannot rely on manual intervention to keep them running. You can’t hire enough engineers to manually intervene on an exponential process.

A crowd of people crossing a street in multiple directions — Are you scaling linearly or exponentially? Photo by Chris Barbalis on Unsplash

Most of the book describes how to use automation to keep team size stable despite increased demand. The core method is to make sure SREs spend 50% of their time tactically and 50% strategically. Tactical time is what I traditionally thought of Ops as: working tickets, administering servers, being on-call, restoring service during an outage, etc. Strategic time is finding root causes and preventing outages. In tactical time an SRE may resolve ten nearly identical tickets; in strategic time they will find the common root cause and resolve it, as well as review how that root problem was introduced (using Five Whys, Postmortem, etc).

This 50/50 time split is a valuable insight! I’ve made a point to carve out time for strategic work and I encourage my mentees to do the same. It’s encouraging that Google seems to explicitly measure and require this split. They monitor each developer’s time allocation and have processes to increase or decrease the strategic time. Increasing the strategic time reduces burnout. Decreasing the strategic time keeps SREs from getting rusty on being effective while on-call. Practice makes perfect. Google wants their SREs to have enough on-call “practice” that they stay sharp, without getting burned out from too much.

Further, the Google process is not just heavily focused on SLO/SLA but tracks the inverse as an error budget. A service with 99.5% uptime SLA has 3.6 hours of error budget per month. (3.6 is 0.5% of the monthly hours). Error budgets are used in interesting two ways:

First, if a service is exceeding its error budget, that service is not allowed to push new features until they get their reliability under control. The error budget is a regulating mechanism for the service’s development!
Second, if a service routinely does not consume its error budget, they will intentionally induce an outage so that service consumers don’t come to expect availability far beyond the SLA.

Person counting money — Use a budget for errors, just like you budget for money! Photo by Sharon McCutcheon on Unsplash

I also appreciated the clarity in Google’s monitoring philosophy. It includes only three outputs:

Alerts – A human must act immediately
Tickets – A human must act within a few days
Logging – For historical record keeping only

If an alert cannot be acted on, it should not be an alert! For example, systems which frequently email status are using an alerting mechanism for something that should be a log. A human receiving regular emails from a system will eventually just delete them all, rather than keeping a watchful eye on their email. Instead, automation should be used so that humans are alerted only when they should act.

The last major insight into how to scale exponentially with linear amounts of people is standardization. The Google process does not always use mandates but provides easy defaults. For instance, services should expose metrics for white-box monitoring. The convention at Google is to use /varz as the monitoring endpoint. And there is a standard library which provides methods to expose metrics in a standard format (using /varz) of course. The Google “Pre-Release Checklist” is not just a list of rules but a list of standard, concrete suggestions to implement those rules.

Hope is not a strategy. This book demonstrates some very concrete strategies to achieve highly reliable software. I highly recommend it!

Woman doing yoga — Use site reliability engineering to plan for success, and breathe easily! Photo by Jared Rice on Unsplash

Leave a Comment Cancel reply