Omission by Design
On Building and Describing Systems
This week we’re shifting gears a little from our normal technical topics to focus on the process of writing about systems–which is obviously a big part of what we do at Systems Approach. And we came to realize that writing about systems has a lot in common with building them, as some of our recent books illustrate.
One of the secrets of writing that I’ve come to appreciate over the years is knowing what to leave out. This starts with deciding what topics are in-scope versus out-of-scope, and then for the selected scope, knowing what’s important to say and what can go unstated. The best (and most entertaining) discussion of this topic I’ve come across is John McPhee’s Draft No. 4: On the Writing Process, a book that helped me appreciate how much of my job as a CS Professor was spent practicing (as a researcher) and teaching (as an advisor) Creative Nonfiction. McPhee makes many insightful observations about the writing process, but one that particularly resonates with me (especially with my systems hat on) is his chapter on Omission.
Deciding what is in-scope versus out-of-scope is something all authors face. Our TCP Congestion Control book forced us to make many decisions to leave stuff out, for the simple reason that we had to select a small subset of the hundreds of available algorithms to report on, to illustrate the fundamental principles at work in Congestion Control. From the first edition of Computer Networks: A Systems Approach we have tried to provide perspective on what is fundamental, not to produce an encyclopedia. As another example, we made a decision to treat QoS as out of bounds, but we did so by first mapping the larger design space for resource allocation, and then declaring what parts of that space are not covered in the rest of the book. To my eye, there is a direct connection between this aspect of writing about a system and the process of distinguishing between requirements and non-requirements when building a system.
Deciding where to draw the line between important concepts and details that can be safely omitted has obvious parallels with the art of defining abstractions, which anyone trained in Computer Science will have an appreciation for. The high-level concepts are codified as modules, the implementation details go unstated, and the relationships and dependencies among modules are made explicit. I’m a strong proponent of clearly describing a system you’re building (beyond simply documenting its API) as an integral part of the design process. Writing helps with clarity of thought, which is important to the simplicity, consistency, and completeness of a system’s design. My experience is that if you can’t clearly describe a system (or component) you’ve built, you’re probably not done designing it yet.
Our experience writing the Edge Cloud Operations book highlights a less obvious (but equally important) aspect of omission: Knowing what you don’t know. The book—which describes a system we also built—focuses on three stakeholders: cloud operators who manage one or more edge clouds, enterprise users who take advantage of services running on an edge cloud, and service developers who build those edge services. The relationships among these three stakeholders are well understood by the hyperscalers when applied to the central clouds they’ve built, which support two models: (1) the hyperscalers offer their own services (e.g., AWS DynamoDB), so users see no distinction between the cloud operator and the service developer; or (2) the hyperscalers provide infrastructure (and building block services) to 3rd-party service developers who in turn interact with users, hiding the cloud operator as an implementation detail.
The complication is that these relationships are more nuanced when the edge cloud is deployed in enterprises. In our approach to building (and describing) edge cloud operations, we decided to support the possibility of all three stakeholders being distinct: Operator A manages an edge cloud running on-prem at Enterprise B, delivering services built by Service Developers X, Y, and Z. And while it is a viable option for an enterprise to select one of the existing hyperscalers to also operate its edge cloud (and they are more than happy to do that, with Google’s Anthos, Microsoft’s Azure Arc, and Amazon’s ECS-Anywhere being prime examples), a major point of the book is to show that the bar for an enterprise being its own cloud operator is not insurmountable. This commonly means that edge services are deployed on Kubernetes clusters and optionally paired with centrally hosted services. So how would that work with respect to the stakeholders?
One option is for the edge cloud operator to mimic the hyperscaler model: start with an IaaS foundation and give each Developer the freedom to acquire vanilla infrastructure and then build whatever operational support they want on top of that infrastructure; for example, they get to provision their own Kubernetes clusters and define their own Lifecycle Management toolchain. In our view, having to install and operate an IaaS layer before you can begin to deploy Kubernetes-based services is a non-starter for edge deployments because it dramatically increases the complexity, so we look for other possibilities.
A second option is for the edge cloud operator to prescribe a particular operational platform (e.g., the book describes a Lifecycle Management toolchain constructed from a combination of RKE, Jenkins, Rancher, Terraform, and Fleet), which forces the service developer to either adapt to those practices, or make their service available as standard artifacts (e.g., Docker images and Helm charts) that can be fed into different CI/CD pipelines. This option shifts responsibility (and pain) from the operator to the developer, which could prove equally problematic.
A third option is to accept that a multi-cloud will emerge within enterprises. Today, most people equate multi-cloud with services running across multiple hyperscalers, but with edge clouds becoming more common, it is possible (if not likely) that enterprises will invite multiple edge clouds onto their local premises, some hyperscaler-operated and some not, each hosting a different subset of edge services. This approach again shifts the burden, this time onto the enterprise.
None of these options is ideal, with each shifting the burden/pain to one of the other stakeholders. So how did we resolve this in the book? We didn’t. In fact, it was while trying to write the architectural overview chapter that we came to appreciate that we were butting up against a hard problem that we could not immediately solve. The book outlines the design space, but we did not let the lack of resolution on this question delay or limit the value provided by the rest of the book (or the system the book describes). This is consistent with one of my favorite system design principles: Solve no problem before its time. McPhee’s essay on omission makes a similar point with an anecdote about the author, then a 19-year-old college sophomore, meeting General Eisenhower five years after the end of World War II. I can’t possibly do the story justice, so I urge you to read it for yourself. But the takeaway is this: systems, books, and even paintings are often improved by what we decide to leave out.
Thanks for reading Systems Approach! Subscribe for free to receive new posts twice a month.
Our prior post on quantum computing turned out to be more timely than we expected, as it was followed a few days later by the announcement that NIST had selected its first four algorithms for post-quantum cryptography. The Economist did a nice job reporting the news in context, even noting that the powers of quantum computers “will be limited to a smallish class of problems” as we pointed out, contrary to much of the hype. The fact that that small class includes most of public-key cryptography underpins the need for agility in cryptographic algorithms.