{"id":473,"date":"2021-11-05T16:12:28","date_gmt":"2021-11-05T20:12:28","guid":{"rendered":"https:\/\/freedville.com\/blog\/?p=473"},"modified":"2021-11-05T16:14:14","modified_gmt":"2021-11-05T20:14:14","slug":"book-review-site-reliability-engineering","status":"publish","type":"post","link":"https:\/\/freedville.com\/blog\/2021\/11\/05\/book-review-site-reliability-engineering\/","title":{"rendered":"Book Review: Site Reliability Engineering"},"content":{"rendered":"\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" loading=\"lazy\" width=\"700\" height=\"467\" src=\"https:\/\/freedville.com\/blog\/wp-content\/uploads\/2021\/11\/photo-1551288049-bebda4e38f71-700x467.jpeg\" alt=\"Dashboard showing application health trends.\" class=\"wp-image-474\" srcset=\"https:\/\/freedville.com\/blog\/wp-content\/uploads\/2021\/11\/photo-1551288049-bebda4e38f71-700x467.jpeg 700w, https:\/\/freedville.com\/blog\/wp-content\/uploads\/2021\/11\/photo-1551288049-bebda4e38f71-300x200.jpeg 300w, https:\/\/freedville.com\/blog\/wp-content\/uploads\/2021\/11\/photo-1551288049-bebda4e38f71-768x512.jpeg 768w, https:\/\/freedville.com\/blog\/wp-content\/uploads\/2021\/11\/photo-1551288049-bebda4e38f71-1536x1024.jpeg 1536w, https:\/\/freedville.com\/blog\/wp-content\/uploads\/2021\/11\/photo-1551288049-bebda4e38f71-2048x1365.jpeg 2048w, https:\/\/freedville.com\/blog\/wp-content\/uploads\/2021\/11\/photo-1551288049-bebda4e38f71-175x117.jpeg 175w\" sizes=\"(max-width: 700px) 100vw, 700px\" \/><figcaption>A dashboard to monitor the health of a service. Photo by <a href=\"https:\/\/unsplash.com\/@lukechesser?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText\">Luke Chesser<\/a> on <a href=\"https:\/\/unsplash.com\/s\/photos\/dashboard?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText\">Unsplash<\/a><\/figcaption><\/figure>\n\n\n\n<p><strong>Book Review: Site Reliability Engineering<\/strong><\/p>\n\n\n\n<p>I\u2019ve just finished reading the 2016 book <a href=\"https:\/\/learning.oreilly.com\/library\/view\/site-reliability-engineering\/9781491929117\/\">Site Reliability Engineering: How Google Runs Production Systems<\/a>.\u00a0 I\u2019ve spent 20 years as a developer and I\u2019ve appreciated the DevOps philosophy, but this is the first \u201cOps-heavy\u201d book I\u2019ve read.\u00a0 DevOps has always suggested to introduce Ops into the development cycle rather than only at deployment time.\u00a0 This book told me what\u2019s going to happen when that Ops focus shifts left.<\/p>\n\n\n\n<p>The book starts with the reminder that a typical application spends 20% of its life in development and 80% of its life in maintenance.\u00a0 The call to action is to <strong>act like it &#8211; focus on maintenance<\/strong>.\u00a0 Google proclaims that they view maintenance through a developer\u2019s lens.<\/p>\n\n\n\n<p>The central insight is that <strong>manual intervention is linear and does not scale well<\/strong>.\u00a0 There are plenty of possible exponentials possible in production: number of services, volume of traffic, or resource consumption.\u00a0 Demand can easily double; at Google scale demand might spike 1000x or more temporarily or permanently.\u00a0 Applications and processes cannot rely on manual intervention to keep them running.\u00a0 You can\u2019t hire enough engineers to manually intervene on an exponential process.\u00a0<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" loading=\"lazy\" width=\"700\" height=\"467\" src=\"https:\/\/freedville.com\/blog\/wp-content\/uploads\/2021\/11\/photo-1534351372027-a01e948b350d-700x467.jpeg\" alt=\"A crowd of people crossing a street in multiple directions\" class=\"wp-image-475\" srcset=\"https:\/\/freedville.com\/blog\/wp-content\/uploads\/2021\/11\/photo-1534351372027-a01e948b350d-700x467.jpeg 700w, https:\/\/freedville.com\/blog\/wp-content\/uploads\/2021\/11\/photo-1534351372027-a01e948b350d-300x200.jpeg 300w, https:\/\/freedville.com\/blog\/wp-content\/uploads\/2021\/11\/photo-1534351372027-a01e948b350d-768x512.jpeg 768w, https:\/\/freedville.com\/blog\/wp-content\/uploads\/2021\/11\/photo-1534351372027-a01e948b350d-1536x1024.jpeg 1536w, https:\/\/freedville.com\/blog\/wp-content\/uploads\/2021\/11\/photo-1534351372027-a01e948b350d-2048x1365.jpeg 2048w, https:\/\/freedville.com\/blog\/wp-content\/uploads\/2021\/11\/photo-1534351372027-a01e948b350d-175x117.jpeg 175w\" sizes=\"(max-width: 700px) 100vw, 700px\" \/><figcaption>Are you scaling linearly or exponentially? Photo by <a href=\"https:\/\/unsplash.com\/@cbarbalis?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText\">Chris Barbalis<\/a> on <a href=\"https:\/\/unsplash.com\/s\/photos\/crowd-of-people?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText\">Unsplash<\/a><\/figcaption><\/figure>\n\n\n\n<p>Most of the book describes how to use automation to keep team size stable despite increased demand.&nbsp; The core method is to make sure SREs <strong>spend 50% of their time tactically and 50% strategically<\/strong>.&nbsp; Tactical time is what I traditionally thought of Ops as: working tickets, administering servers, being on-call, restoring service during an outage, etc.&nbsp; Strategic time is finding root causes and preventing outages.&nbsp; In tactical time an SRE may resolve ten nearly identical tickets; in strategic time they will find the common root cause and resolve it, as well as review how that root problem was introduced (using Five Whys, Postmortem, etc).<\/p>\n\n\n\n<p>This 50\/50 time split is a valuable insight!\u00a0 I\u2019ve made a point to carve out time for strategic work and I encourage my mentees to do the same.\u00a0 It\u2019s encouraging that Google seems to <em>explicitly<\/em> measure and require this split. \u00a0They monitor each developer\u2019s time allocation and have processes to increase or decrease the strategic time.\u00a0 Increasing the strategic time reduces burnout.\u00a0 Decreasing the strategic time keeps SREs from getting rusty on being effective while on-call.  Practice makes perfect.  Google wants their SREs to have enough on-call &#8220;practice&#8221; that they stay sharp, without getting burned out from too much.<\/p>\n\n\n\n<p>Further, the Google process is not just heavily focused on SLO\/SLA but tracks the inverse as an <strong>error budget<\/strong>.&nbsp; A service with 99.5% uptime SLA has 3.6 hours of error budget per month.&nbsp; (3.6 is 0.5% of the monthly hours).&nbsp; Error budgets are used in interesting two ways:<\/p>\n\n\n\n<ul><li>First, if a service is exceeding its error budget, that service is not allowed to push new features until they get their reliability under control.\u00a0 <strong>The error budget is a regulating mechanism for the service&#8217;s development<\/strong>!\u00a0<\/li><li>Second, if a service routinely does not consume its error budget, they will <strong>intentionally induce an outage<\/strong> so that service consumers don\u2019t come to expect availability far beyond the SLA.<\/li><\/ul>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" loading=\"lazy\" width=\"700\" height=\"467\" src=\"https:\/\/freedville.com\/blog\/wp-content\/uploads\/2021\/11\/sharon-mccutcheon-8a5eJ1-mmQ-unsplash-700x467.jpeg\" alt=\"Person counting money\" class=\"wp-image-476\" srcset=\"https:\/\/freedville.com\/blog\/wp-content\/uploads\/2021\/11\/sharon-mccutcheon-8a5eJ1-mmQ-unsplash-700x467.jpeg 700w, https:\/\/freedville.com\/blog\/wp-content\/uploads\/2021\/11\/sharon-mccutcheon-8a5eJ1-mmQ-unsplash-300x200.jpeg 300w, https:\/\/freedville.com\/blog\/wp-content\/uploads\/2021\/11\/sharon-mccutcheon-8a5eJ1-mmQ-unsplash-768x512.jpeg 768w, https:\/\/freedville.com\/blog\/wp-content\/uploads\/2021\/11\/sharon-mccutcheon-8a5eJ1-mmQ-unsplash-1536x1024.jpeg 1536w, https:\/\/freedville.com\/blog\/wp-content\/uploads\/2021\/11\/sharon-mccutcheon-8a5eJ1-mmQ-unsplash-2048x1365.jpeg 2048w, https:\/\/freedville.com\/blog\/wp-content\/uploads\/2021\/11\/sharon-mccutcheon-8a5eJ1-mmQ-unsplash-175x117.jpeg 175w\" sizes=\"(max-width: 700px) 100vw, 700px\" \/><figcaption>Use a budget for errors, just like you budget for money! Photo by <a href=\"https:\/\/unsplash.com\/@sharonmccutcheon?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText\">Sharon McCutcheon<\/a> on <a href=\"https:\/\/unsplash.com\/s\/photos\/budget?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText\">Unsplash<\/a><\/figcaption><\/figure>\n\n\n\n<p>I also appreciated the clarity in Google\u2019s monitoring philosophy.&nbsp; It includes only three outputs:<\/p>\n\n\n\n<ul><li>Alerts &#8211; A human must act immediately<\/li><li>Tickets &#8211; A human must act within a few days<\/li><li>Logging \u2013 For historical record keeping only<\/li><\/ul>\n\n\n\n<p>If an alert cannot be acted on, it should not be an alert!&nbsp; For example, systems which frequently email status are using an alerting mechanism for something that should be a log.&nbsp; A human receiving regular emails from a system will eventually just delete them all, rather than keeping a watchful eye on their email.&nbsp;&nbsp; Instead, automation should be used so that humans are alerted only when they should act.<\/p>\n\n\n\n<p>The last major insight into how to scale exponentially with linear amounts of people is standardization.\u00a0 The Google process does not always use mandates but provides easy defaults.\u00a0 For instance, services should expose metrics for white-box monitoring.\u00a0 The convention at Google is to use <code>\/varz<\/code> as the monitoring endpoint. \u00a0And there is a standard library which provides methods to expose metrics in a standard format (using <meta charset=\"utf-8\"><code>\/varz<\/code>) of course.\u00a0 The Google \u201cPre-Release Checklist\u201d is not just a list of rules but a list of standard, concrete suggestions to implement those rules.<\/p>\n\n\n\n<p><strong>Hope is not a strategy<\/strong>.\u00a0 This book demonstrates some very concrete strategies to achieve highly reliable software.\u00a0 I highly recommend it!<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" loading=\"lazy\" width=\"700\" height=\"814\" src=\"https:\/\/freedville.com\/blog\/wp-content\/uploads\/2021\/11\/jared-rice-NTyBbu66_SI-unsplash-700x814.jpeg\" alt=\"Woman doing yoga\" class=\"wp-image-477\" srcset=\"https:\/\/freedville.com\/blog\/wp-content\/uploads\/2021\/11\/jared-rice-NTyBbu66_SI-unsplash-700x814.jpeg 700w, https:\/\/freedville.com\/blog\/wp-content\/uploads\/2021\/11\/jared-rice-NTyBbu66_SI-unsplash-258x300.jpeg 258w, https:\/\/freedville.com\/blog\/wp-content\/uploads\/2021\/11\/jared-rice-NTyBbu66_SI-unsplash-768x893.jpeg 768w, https:\/\/freedville.com\/blog\/wp-content\/uploads\/2021\/11\/jared-rice-NTyBbu66_SI-unsplash-1321x1536.jpeg 1321w, https:\/\/freedville.com\/blog\/wp-content\/uploads\/2021\/11\/jared-rice-NTyBbu66_SI-unsplash-1762x2048.jpeg 1762w\" sizes=\"(max-width: 700px) 100vw, 700px\" \/><figcaption>Use site reliability engineering to plan for success, and breathe easily! Photo by <a href=\"https:\/\/unsplash.com\/@jareddrice?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText\">Jared Rice<\/a> on <a href=\"https:\/\/unsplash.com\/s\/photos\/yoga?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText\">Unsplash<\/a><\/figcaption><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Book Review: Site Reliability Engineering I\u2019ve just finished reading the 2016 book Site Reliability Engineering: How Google Runs Production Systems.\u00a0 I\u2019ve spent 20 years as a developer and I\u2019ve appreciated the DevOps philosophy, but this&#8230;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[1],"tags":[],"_links":{"self":[{"href":"https:\/\/freedville.com\/blog\/wp-json\/wp\/v2\/posts\/473"}],"collection":[{"href":"https:\/\/freedville.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/freedville.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/freedville.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/freedville.com\/blog\/wp-json\/wp\/v2\/comments?post=473"}],"version-history":[{"count":3,"href":"https:\/\/freedville.com\/blog\/wp-json\/wp\/v2\/posts\/473\/revisions"}],"predecessor-version":[{"id":480,"href":"https:\/\/freedville.com\/blog\/wp-json\/wp\/v2\/posts\/473\/revisions\/480"}],"wp:attachment":[{"href":"https:\/\/freedville.com\/blog\/wp-json\/wp\/v2\/media?parent=473"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/freedville.com\/blog\/wp-json\/wp\/v2\/categories?post=473"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/freedville.com\/blog\/wp-json\/wp\/v2\/tags?post=473"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}