Designing Data-Intensive Applications by Martin Kleppmann is a tour de force, a wonderful book for anyone interested in learning about the kinds of applications where data is the key constraint. Kleppmann’s central thesis is that applications are rarely CPU-bound, especially given that CPUs are not getting much faster, and parallelism is on the rise. You’ve probably heard that Internet companies and hyperscalers achieve their amazing scale via horizontal scaling rather than vertical scaling (adding more servers instead of making each server bigger). This book fundamentally lays out how that is possible.
As a university student, I did not fully appreciate how relational databases worked so well until the instructor did a lesson on B-trees. I knew a primary law of computing: “’never’ go to the disk”, but could not fathom how that was possible. On seeing a B-tree I first thought “what an amazingly odd data structure”. Then I had a flash of insight, on how this data structure prevented you from searching a disk for some database record (assuming you had enough RAM to keep the B-tree in memory). That one lesson made four years of computer theory make sense. Designing Data-Intensive Applications had the same impact on me.
The book starts with a definition of scalability: a system’s ability to deal with increased load. Kleppmann reminds the reader to never say “X doesn’t scale” but instead to address “if a system grows in a certain way, how can we deal with the growth?”. (This is a much more thoughtful question than you often see on Internet forums!) That key point is worth repeating – scaling is not a binary question.
Kleppmann starts at the beginning. The book dives into common data structures used, even in single-system applications. I even got to see B-trees again! After each structure is introduced, he starts describing how these structures can work when more machines are added. “Horizontal scaling” seems easy to talk about, but Kleppmann describes how it actually works.
A key concern in data-intensive applications is how to keep the data consistent. Scaling is generally achieved by adding more copies of things, such as application servers, worker nodes, or database replicas, but it’s no small feat to keep those in sync. When you know that anything can fail, and you don’t want single points of failure, you’re inclined to add duplicates. But if there are two independent write operations to the same record, which should win? It’s challenging in a single-server scenario and even harder when multiple servers are involved. Kleppmann walks through all of the scenarios, the common solutions, and the tradeoffs involved.
A delightful bonus is that each chapter is adorned with a Tolkien-like map. Seeing “The Land of the B-Trees” and the “Log Structured Storage” in relation to the “Kingdom of Analytics” was a fun way to conceptualize the different kinds of databases one may encounter.
This book is a wonderful primer to any study of System Design and has been a wonderful complement to the microservices books I’ve been reading (Microservices In Action, Microservices Patterns, and Microservices: A Practical Guide). I’m glad I got to revisit many concepts from those books in a different context. Designing Data-Intensive Applications reflects a lifetime’s worth of knowledge and it’s easy to see how the concepts from this book have been with us since the dawn of computing. This book is well worth the read!