I recall how hard it was to find the right information when I first started out. Some books on distributed systems were too theoretical, some felt like marketing material for a technology stack that would be obsolete in a few years. All I wanted was something pragmatic - a timeless classic. Only whitepapers of real systems satisfied me. But, without a solid grasp of the fundamentals, I spent hours trying to connect all the missing dots.
This is why I decided to write a manual to teach the fundamentals of large scale distributed systems. The book contains knowledge I have used over the years to solve concrete problems, the kind that needs to scale to millions of requests per second, and billions of devices. But, no matter the scale of the systems you work on today, the core principles are universal.
After reading the book, you are not going to look at a network calls the same way. And you will start applying that knowledge from day one at your job and on personal projects. Armed with the core fundamentals, you will have the tools to understand technical whitepapers, build systems of your own, and nail interviews.
Having a solid foundation of the network stack is essential as you can't build a distributed system without it. Even though each network protocol builds up on top of the other, sometimes the abstractions leak. If you don’t know how the stack works under the hood, you will have a hard time troubleshooting why your system is down or degraded for no apparent reason. On top of that, there is a lot you can learn from the design of the core protocols that can be applied to any distributed system, like TCP’s backpressure mechanisms.
Imagine some code that assigns a value to a variable. Then the same code reads the variable right after only to find out the write had no effect! Madness! But with eventual consistency, this is what can happen when one machine writes a value to a store and another, perhaps the same, reads it.
This is where consistency guarantees come in, which define what can and can’t happen. Strong consistency guarantees make our lives easier. But to provide these guarantees, we need to find a way to make networked machines cooperate in harmony. In this chapter, we will explore how to achieve that by solving consensus.
Now that we know how to make a set of nodes cooperate, we can dive into the patterns and architectures used to create horizontally scalable systems. We will start with the basics of sharding and replication and slowly transition into more advanced topics such as the implementation of load balancers, content delivery networks, and batch processing frameworks.
At scale, anything that can go wrong will go wrong. Writing distributed code is different than writing code that runs on a single machine. If you thought multi-threading was hard, think again.
The systems you build need to be robust against failures and unexpected events. Think of spikes of incoming requests and failing downstream dependencies. In this chapter, we will look into self-healing mechanisms that guard our systems against these agents of chaos.
You don't want your system to fall down in the middle of the night and find out about it the next morning through a Reddit post. No matter how elegant your design is, if the system lacks monitoring and logging, it’s doomed to fail. Nobody wants to be on call for a black box. In this chapter, you will learn the best practices on how to instrument and operate large scale systems.
Get a preview of the book, which includes the introduction and the section on TCP.
The book is currently priced at 20$ while in pre-release, and I plan to increase its price as I add more chapters. If you buy the book while in pre-release, you will have access to all future updates for free, and you can request a refund within 45 days - no hard feelings.
Or sign up to receive updates about the book.
Hi! My name is Roberto Vitillo. I have over 15 years’ experience in the tech industry as a software engineer, tech lead, and manager.
In 2017 I joined Microsoft to work on an internal data platform as a SaaS product. Since then, I have helped launch two public SaaS products, Product Insights and Playfab. The data pipeline I am responsible for is one of the largest in the world. It processes millions of events per second from billions of devices worldwide.
Before that, I worked for Mozilla, where I wore different hats, from performance engineer to data platform engineer. What I am most proud of is having set the direction of the data platform from its very early days and built a large part of it, including the team.
The best way to breeze through a system design round is to learn how to design actual real systems. Who would have thought, right? Unlike algorithmic puzzles, it’s not an “interview-only” skill. It requires hard work and experience. But if you approach it the right way, then you will add a powerful tool to your toolbelt. One that will make you stand out from the crowd.
When I say system design, I mean the distributed kind. It used to be that large scale system design questions were only asked by the likes of Amazon, Google, and Microsoft. But that’s no longer true. The reality is that nowadays we are all distributed systems engineers. And we need to understand the implications of building complex systems out of a networked mesh of simpler ones.
If you are a junior engineer, then you might be able to wing the design round and still get an offer. If you are a senior engineer, then you need to be able to design complex systems and dive deep in any vertical. You can be a world champion in balancing trees, but if you fail the design round, you are out. If you just meet the bar, then don’t be surprised when your offer is well below what you were expecting.