Book Review: The Bad Data Handbook

I recently finished reading The Bad Data Handbook: Cleaning Up The Data So You Can Get Back to Work by O’Reilly Publishing. This book is a collection of 18 essays about the art and science of Data Science. The essays vary from high-level data science problems to technical deep dives on interesting problems. If you are taking part in a cognitive or data science project, no matter your role, there is something in this book for you.

The first couple essays “Is It Just Me, or Does This Data Smell Funny?”, “Data Intended for Human Consumption, Not Machine Consumption”, and “Bad Data Lurking in Plain Text” set the stage for business owners and project managers. These essays lay out in clear language some common hurdles that data science projects need to jump over. It’s often said that more time is spent doing data janitor work than data science, and these essays help explain why.

Another batch of essays discuss how to check your assumptions about data before and after analysis. “When Data and Reality Don’t Match” walks the reader through an exercise in modeling the stock market, which seems easy until you consider that stocks split, move to new ticker symbols, or revive “old” ticker symbols. “Will the Bad Data Please Stand Up” has a great story about a manufacturing line intending to use a data science project to predict defects, that was cancelled when back-of-the-envelope calculations showed the measurement equipment was not precise enough to produce input data. “Blood, Sweat, and Urine” exemplified the importance of using automation in your data gathering pipeline. “Subtle Sources of Bias and Error” is a treatise on how to deal with “incomplete” data – the bane of many data scientists.

My favorite essay was “Detecting Liars and the Confused in Contradictory Online Reviews”. The essay covered a typical classification process used to sort plain-text reviews into 1-star and 5-star buckets. Hijinks ensued when the author found sarcastic, nefarious, and just plain confused ratings, where users knowingly or unknowing created scathing 5-star reviews or glowing 1-star reviews. This was a great example of continuously improving a data science model and also a primer on how to check your ground truth quality.

No matter your background or role, the Bad Data Handbook has something for you if you want to learn more about data science.

Leave a Comment Cancel reply