{"id":187,"date":"2017-08-23T01:14:17","date_gmt":"2017-08-23T01:14:17","guid":{"rendered":"http:\/\/freedville.com\/blog\/?p=187"},"modified":"2019-04-12T14:08:28","modified_gmt":"2019-04-12T14:08:28","slug":"book-review-the-bad-data-handbook","status":"publish","type":"post","link":"https:\/\/freedville.com\/blog\/2017\/08\/23\/book-review-the-bad-data-handbook\/","title":{"rendered":"Book Review: The Bad Data Handbook"},"content":{"rendered":"<p>I recently finished reading&nbsp;<a href=\"https:\/\/smile.amazon.com\/Bad-Data-Handbook-Cleaning-Back-ebook\/dp\/B00A3IGAIA\/\">The Bad Data Handbook: Cleaning Up The Data So You Can Get Back to Work<\/a> by O&#8217;Reilly Publishing. &nbsp;This book is a collection of 18 essays about the art and science of Data Science. &nbsp;The essays vary from high-level data science problems to technical deep dives on interesting problems. &nbsp;If you are taking part in a cognitive or data science project, no matter your role, there is something in this book for you.<\/p>\n<p>The first couple essays &#8220;Is It Just Me, or Does This Data Smell Funny?&#8221;, &#8220;Data Intended for Human Consumption, Not Machine Consumption&#8221;, and &#8220;Bad Data Lurking in Plain Text&#8221; set the stage for business owners and project managers. &nbsp;These essays lay out in clear language some common hurdles that data science projects need to jump over. &nbsp;It&#8217;s often said that more time is spent doing <a href=\"https:\/\/www.nytimes.com\/2014\/08\/18\/technology\/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html\">data janitor<\/a> work than data science, and these essays help explain why.<\/p>\n<p>Another batch of essays discuss how to check your assumptions about data before and after analysis. &nbsp;&#8220;When Data and Reality Don&#8217;t Match&#8221; walks the reader through an exercise in modeling the stock market, which seems easy until you consider that stocks split, move to new ticker symbols, or revive &#8220;old&#8221; ticker symbols. &nbsp;&#8220;Will the Bad Data Please Stand Up&#8221; has a great story about a manufacturing line intending to use a data science project to predict defects, that was cancelled when back-of-the-envelope calculations showed the measurement equipment was not precise enough to produce input data. &nbsp;&#8220;Blood, Sweat, and Urine&#8221; exemplified the importance of using automation in your data gathering pipeline. &nbsp;&#8220;Subtle Sources of Bias and Error&#8221; is a treatise on how to deal with &#8220;incomplete&#8221; data &#8211; the bane of many data scientists.<\/p>\n<p>My favorite essay was &#8220;Detecting Liars and the Confused in Contradictory Online Reviews&#8221;. &nbsp;The essay covered a typical classification process used to sort plain-text reviews into 1-star and 5-star buckets. &nbsp;Hijinks ensued when the author found sarcastic, nefarious, and just plain confused ratings, where users knowingly or unknowing created scathing 5-star reviews or glowing 1-star reviews. &nbsp;This was a great example of continuously improving a data science model and also a primer on how to check your ground truth quality.<\/p>\n<p>No matter your background or role, the&nbsp;<a href=\"https:\/\/smile.amazon.com\/Bad-Data-Handbook-Cleaning-Back-ebook\/dp\/B00A3IGAIA\/\">Bad Data Handbook<\/a> has something for you if you want to learn more about data science.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I recently finished reading&nbsp;The Bad Data Handbook: Cleaning Up The Data So You Can Get Back to Work by O&#8217;Reilly Publishing. &nbsp;This book is a collection of 18 essays about the art and science of&#8230;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[1],"tags":[2],"_links":{"self":[{"href":"https:\/\/freedville.com\/blog\/wp-json\/wp\/v2\/posts\/187"}],"collection":[{"href":"https:\/\/freedville.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/freedville.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/freedville.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/freedville.com\/blog\/wp-json\/wp\/v2\/comments?post=187"}],"version-history":[{"count":2,"href":"https:\/\/freedville.com\/blog\/wp-json\/wp\/v2\/posts\/187\/revisions"}],"predecessor-version":[{"id":415,"href":"https:\/\/freedville.com\/blog\/wp-json\/wp\/v2\/posts\/187\/revisions\/415"}],"wp:attachment":[{"href":"https:\/\/freedville.com\/blog\/wp-json\/wp\/v2\/media?parent=187"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/freedville.com\/blog\/wp-json\/wp\/v2\/categories?post=187"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/freedville.com\/blog\/wp-json\/wp\/v2\/tags?post=187"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}