In our last newsletter we noted how hard it is to build and operate a distributed system in such a way that it handles all the unanticipated failures that may affect it. This week, we’re reflecting on the experience we’ve had meeting those challenges and the related issues of operating cloud systems, which forms the basis for our latest Systems Approach book.
The most rewarding experience of my 30+ year research career was building and running PlanetLab, which I reflected upon when we decommissioned it last year. Of the many lessons it taught me, appreciating the challenge of operationalizing an Internet-scale distributed system that runs continuously—through hundreds of bug fixes, security patches, feature enhancements, software upgrades, and hardware refreshes—is at the top of the list. That’s now the core job description for thousands of DevOps engineers responsible for cloud infrastructure, but at the time we started PlanetLab in 2002, our small team was flying blind… and we often ran into obstacles, sometimes repeatedly, before figuring out how to get past them.
The tooling available today is lightyears ahead of where we were 15+ years ago, thanks in large part to open source made available by the cloud providers that were on a similar (but certainly higher-stakes) journey. A great example is Google’s BORG begetting Kubernetes, which then catalyzed a vibrant ecosystem of provisioning, integration, deployment, and management tools. Many times over the years (with both PlanetLab and commercial edge clouds I’ve worked on), we’ve decided to abandon a home-grown mechanism because a better community-supported and widely-adopted solution had become available.
This is something I’ve been thinking a lot about recently, as edge clouds gain momentum in the marketplace. Operating a cloud is fundamentally about balancing feature velocity against service stability, and even though the tooling continues to improve, the underlying problem hasn’t gotten any easier. More importantly, the design principles that we came to understand when we were trying to make progress with duct tape and baling wire remain unchanged. Sometimes those principles are easy to overlook when your tools become more powerful and complex, but they are what’s enduring.
As a result of that experience, I’ve come to view the challenge of operationalizing a cloud as a classic systems problem of managing state—configuration state, but a collection of state variables nonetheless. This reduces to controlling who gets to write each variable, ensuring that every component that needs to read a variable eventually does, synchronizing access among multiple readers and writers, recovering from partial failures, and so on. That’s the academic in me reducing the problem to first principles—which I realize is of little comfort to a practitioner in the middle of a git merge that will trigger a re-deployment—but I have found this systems lens is helpful (if not essential) when trying to tease apart the many factors that are so easily conflated.
One lesson that we kept learning (and relearning because the root problem manifests in different ways) is the importance of establishing a single source of truth for every configuration variable. This forces you to think critically about (a) the distinction between variables that are written by humans and those that are derived from other variables, and (b) who is allowed to write each “source” variable. There are several best practices that follow from this focus on the source of truth, such as: making the variables explicit rather than burying them in the middle of scripts; version controlling the source variables, so it is possible to roll back to previous values; maximizing the state that is derived, which minimizes the state humans are responsible for setting/changing; recognizing that schema and model definitions are a kind of configuration state; making operations on configuration state idempotent; and programming components to assume they are not the source of truth. This is by no means an exhaustive list, but we have started the process of documenting these lessons.
There are, of course, many pragmatic issues that influence implementation choices, such as how frequently a given variable is likely to change, but these are secondary to correctness (the goal is to deploy the intended configuration every time) and repeatability (it should be possible to update and redeploy the system continuously). This brings me back to the available tooling, which by its nature, comes with baked-in engineering decisions. Many of those tools encourage best practices—the Configuration-as-Code paradigm championed by Helm and Terraform is a great example—but (1) it’s not easy to select the right tool for the situation when there are so many to choose from, and (2) there are hard problems that no amount of tooling can correct for. These were on my mind while writing the first draft of our new book: Operating Edge Clouds: A Systems Approach.
On the first point, there are (at least) several dozen DevOps-related open source projects available, and navigating the project space is one of the biggest challenges we faced in putting together the cloud management platform we used as a case study in the book. This is in large part because these projects are competing for mindshare, with both significant overlap in the functionality they offer and implicit dependencies on each other. Keeping in mind that there is no single tool that solves all problems (despite what you might read in the project description), the ultimate challenge is to assemble the available parts into a coherent end-to-end system. It turned out that the book is a retrospective on that decision-making process. For example, one of the tools we started with, Rancher, ended up playing a much smaller role than we originally expected it to. That’s not a unique example, as we integrated over 20 narrowly targeted components. Deciding what features of each tool not to use happened over and over.
On the second point, the problem remains hard (and the solution elusive) for several reasons. In some cases, there are variable settings that originate in external systems, which is to say, the single-source-of-truth is sometimes a process (or service) that you have to query. These sources need to be incorporated into an end-to-end solution rather than treated as exceptions. This happened in the 5G-capable edge cloud we’ve been building, for example, because it’s necessary to call a remote Spectrum Access Service (SAS) to learn how to configure the radio settings for the small cells you’ve deployed. Naively, you might think that’s a variable you could pull out of a YAML file stored in a git repository. In other cases, it’s problematic to not take UX requirements into account, for example, by assuming a component-specific config file is the source-of-truth, when in fact a class of end-users expect to be able to change a subset of the configuration variables at runtime.
The bottom line is that there are many moving parts involved in operationalizing a cloud, and while dealing with that complexity has largely been left to the hyperscalers, the migration of the cloud into enterprises and other edge locations brings the problem front-and-center for many. The reality is that embedding the cloud in edge environments exposes it to many local variables, all of which have the potential to be the source of truth. Using an open source edge cloud as a case study, we have undertaken to provide a tour of the available tools and document the design principles that should be brought to bear on that challenge.
Last week we got another chance to talk about the Systems Approach (and a wide range of networking topics) with Greg Ferro over at Packet Pushers. Bruce posted a video on technical leadership on his YouTube channel, and Larry’s tutorial on the 5G-Connected Edge Cloud is also available (Part 1, Part 2), co-presented with Jen Rexford and Nate Foster.