Boundary Conditions: A Systems Approach
After a few weeks talking about TCP and QUIC, this week we return to one of our favorite topics: platforms. Building platforms is often an exercise in making tradeoffs, which provides the impetus for this week’s post.
I recently attended the kickoff workshop for the FlexNet Project, where research on building more dynamically programmable networks was framed as a sort of Maslow’s Pyramid. This is a diagram Amin Vahdat originally presented at ONF Connect 2018 to convey the importance of Availability in everything Google does when building cloud infrastructure. It’s the foundation you have to get right before worrying about anything else, and no new feature or improvement higher in the pyramid is introduced unless it preserves or improves availability. Similarly, Manageability and Feature Velocity follow, in that order of priority. Performance improvements (shown at the top of the pyramid) are the least important consideration. (Stranding could be labeled “Efficiency” in the sense that it’s about making sure you have the right kind of capacity in the right place.)
Amin was talking about cloud infrastructure, but it strikes me that the idea also applies to computing systems in general, which, like infrastructure, often involve platforms that others (e.g., app developers) can build upon. Even Capacity Delivery is a consideration when we’re talking about scalable systems, although most of the time we’re focused on virtual rather than physical resources, in which case spinning up additional capacity is clearly part of the system’s management machinery. I realized this approach can be generalized to the sorts of systems I’ve been involved with over my career.
First, and most directly, the pyramid is a great visualization of the “systems approach” that Bruce and I talk about all the time—in the sense that every system addresses a set of requirements, but the real challenge is understanding how to trade those requirements off against each other. Individually, no one can argue with the virtue of Manageability or Velocity or Availability (or whatever your favorite set of -ities might be), but each system should have a clear set of priorities. In the case above:
Availability > Manageability > Velocity > Stranding > Performance
And as a corollary, any obsession on just one dimension of the design might succeed in optimizing to reach a local maximum, but be rendered irrelevant (or worse, counterproductive) in the larger scheme of things. This is just “System Design 101”, but worth repeating.
Second, while the systems research community has long been focused on Performance and Availability—in part because both lend themselves to quantitative evaluation, but also because those experiments are easy to control by isolating the problem and ignoring the rest of the pyramid—Velocity is now also getting its fair due. Coming up with good metrics is still a challenge, but “Software-Defined Anything” is now widely accepted as a credible research agenda. And new efforts like FlexNet take that a step further by trying to make software-defined systems even more dynamically upgradable. (Now if only we could go back and introduce more dynamicity into our most mature software-based platform—the OS—we’d have real progress. Maybe eBPF is the answer, but that’s probably a discussion for another time.)
Third, what I find most interesting is this: once you’ve prioritized your requirements, how do you manage the boundaries and interactions between them? Here are two examples at the interplay between Velocity and Manageability, which are often in conflict. The first is one I encountered over a decade ago while working at a CDN startup. (I’ve come to think of a CDN as a PaaS for content providers, who both publish content and program how it is delivered.) In any startup, feature velocity is paramount, and in practice (even if not in principle) it is often prioritized over manageability. This leads to the following all-too-common scenario: Customer asks for feature X; Engineers implement feature X; Operators are unable to change the cryptic config-file for feature X without help from Engineers. That’s an approach to manageability that does not scale, but either you prioritize “plumbing” every feature through to a well-defined management interface, or you pay for your architectural debt at some point down the road. (More on that after my second example.)
The other example is one that’s recently emerged due to SDN, and we’ve had to address it while building private 5G Connectivity-as-a-Service for enterprises. Suppose you have a programmable data plane, and perhaps have even figured out how to reprogram it on-the-fly with no negative impact on current packet flows. If you add a new feature, you likely also need to upgrade the control plane. The joint development of P4 (for programming the forwarding pipeline) and tooling around P4Runtime (to auto-generate the control API) is a huge step forward on this front. But if you update the control plane, you may also need to update the management plane, and this needs to be coordinated across all three planes. This is where the challenge is the same as in the first example: We need a well-defined management API.
Part of the solution comes in the form of modern DevOps tooling. The availability of declarative configuration specs such as Kubernetes CRDs, Helm Charts, and Terraform Templates is an improvement over having to deal with dozens of one-off config files, but they assume it’s an operator who will be responsible for managing the system. If it’s an end-user of a multi-tenant, cloud-managed platform service (for example, a content provider in the case of a CDN or an enterprise admin in the case of a managed connectivity service), then management directives need to be codified in a programmatic API. And most importantly, this API has to be updated in sync with all the layers below it. There is no standard P4/P4RT-equivalent for doing this, but from my experience, a reasonable approach is to define a data model for the abstract service you are trying to provide (e.g., using a language like YANG), around which you build tools to auto-generate the Platform-as-a-Service API. This approach is the focus of one of the chapters in our Edge Cloud Operations book.
There’s one final lesson we can take away from these two examples that applies to the general challenge of managing the boundary between Manageability and Velocity. It’s one that we touched on in our post about Observability (an aspect of Manageability). A common refrain you hear is frustration about needing to “plumb the feature through to the API so it can be controlled” and needing to “instrument the feature so it can be observed”. The frustration comes from both sides: from operators who need to control and observe the system and from developers who have to deal with the burden of making it so (often at the expense of the next feature on their todo list). Acknowledging the friction at this boundary is the first step, but to me this challenge looks like an opportunity to build a platform. Management is clearly a first-class property of any system you build, but I would argue it can (and should) also be a platform in its own right, one that provides value by making it easier to “control and observe” new features. Service meshes like Istio do exactly that, but whatever your mechanism of choice, the goal should be to support feature velocity by streamlining how those features are managed.
Our podcast with Robbie Mitchell of APNIC has been published – you can hear Larry and Bruce explain the systems approach and how a protocol like QUIC exemplifies it. Lots of people are offering hot takes about the Optus data breach; we’ll wait for more analysis before giving ours, but there has to be an interesting story about how a customer database got plumbed into an open API requiring no authentication. The Nobel Prize in physics went to a trio of researchers who demonstrated the weird consequences of quantum entanglement, which happens to lie at the heart of quantum computing.
Welcome to all the new subscribers who joined us in the last fortnight!