Service Mesh and the Goldilocks Zone

Last month, Larry pointed out that service mesh can be thought of as “SDN for Layer 7”, which got me thinking about how service mesh relates to the traditional networking stack. This week we’ll dive deeper into how service mesh applies some architectural principles about protocol layering in a new context.

Like many people in the networking field, I learned about the 7-layer model as an established fact and I proceeded to approach networking problems by ignoring whatever was going on above or below the layer I was currently working on. For much of my early career I was able to stay focused on layers 2 through 4, occasionally cursing layer 1 when I had to polish the end of an optical fiber to improve the reliability of a SONET link. 

A talk that I still remember well from those days was one given by David Clark of MIT at the 1990 SIGCOMM conference entitled “Architectural Considerations for a New Generation of Protocols”. For one thing, it motivated me to become a better public speaker when I saw how much impact Dr. Clark’s presentation had on the audience. But it also shook me out of my layerist comfort zone with its twin ideas of Application Layer Framing (ALF) and Integrated Layer Processing (ILP). To summarize them very briefly, ALF says that only the application really knows how its data is to be used, so it’s best positioned to break a stream of data into frames. And ILP argues that just because you have a layered architecture, doesn’t require you to implement layers strictly in isolation; in fact, you are likely to suffer performance problems if you do so. 

In other words, this paper gave me my first experience of how to take a systems view of networks: you have to look at the overall system behaviour, considering interactions among layers, all the way up to the application, to properly understand how to design and implement a system. This, along with many other interactions with “systems people”, led to my embracing the Systems Approach. (My willingness to ignore the numerous people who insist that MPLS is a layer violation can probably also be traced back to that talk but that’s a story for another day.) 

With this background, I see service mesh as something of a corollary of the ALF/ILP ideas. The sidecar proxies of a service mesh run as close to the application as possible, because they need fine-grained visibility into application behavior, just as ALF argued that tasks like the construction of data frames and dealing with lost frames belong in the application layer. 

The service mesh is also an example of “The Goldilocks Zone”, in that the service mesh sidecar needs to be close to the application but not too close.  The Goldilocks Zone refers to the zone for habitable planets (neither too hot nor too cold), and was used as an analogy by Martin Casado and Tom Corn beginning in 2014 to explain why security features, such as distributed firewalling, should be as close as possible to an application without actually residing in the host operating system. Their idea was that security features needed to balance isolation versus context. Too close to the application and your security features would be disabled by an attacker; too far away, and there wasn’t enough context to understand what was being secured, as when a firewall tries to figure out what an application is doing based only on the packets that have left the endpoint. In a network virtualization system, the virtual switch sitting in the hypervisor provides that “just right” spot that is neither in the host OS nor sitting far away in a network appliance. Endpoint detection and response (EDR) systems also took advantage of this idea. 

In a Service Mesh, the tradeoff is similar. The service mesh data plane needs to be close enough to the application to have the necessary context. In particular, the sidecar needs to be able to see unencrypted traffic right at the point that it enters or leaves the microservice, because once the traffic is encrypted there is no way to tell one API request from another. Sitting next to the microservice in a sidecar allows security controls to be applied at the appropriate granularity and gives sufficient visibility of events to provide a high degree of observability. 

At the same time, we don’t want the service mesh functionality to reside inside the application. Here the argument differs slightly from Casado and Corn’s use of Goldilocks, in that it is not about isolation so much as it is about building a general purpose platform that supports all applications. The scenario that Service Mesh avoids is one in which every application developer writes their own encryption code (or, more likely, picks up some library that then needs to be maintained), figures out how to observe events, manages certificates, etc. Instead, a service mesh provides a platform of generally useful services including traffic management, access control, encryption, observability and so on. Not only are these implemented in a consistent manner without burdening every application team, they can also be managed by a team with the appropriate expertise, i.e., a team focused on platform issues such as security rather than the application teams themselves. 

There’s another benefit to keeping the service mesh functionality outside of the applications, which has been memorably explained by Louis Ryan, one of the Istio project leaders. That is the challenge of maintaining old code, which he compares to opening a can of surströmming (fermented fish). The separation of service mesh from the application allows all the features required for the platform to be inserted without opening up old pieces of code, and to be maintained independently. 

Finally, it’s worth noting that performance issues arise in service mesh, just as they did with layered protocol processing in the 1990s. ILP was a reaction to the negative performance impact of strict layering as an implementation strategy, and the simplest implementation of sidecar proxies, which forces packets to make multiple user/kernel space traversals, suffers the same issue. This is what is driving more optimized approaches such as eBPF and the Cilium project, which is a topic we’ll come back to.

At this point, I’m encouraged to see the high level of interest in service mesh, and I don’t think it’s just the technology flavor of the month. There is a set of solid architectural principles underpinning it and I expect its adoption among application platform teams to keep expanding. 

The article that probably did the most to fill my Twitter feed in the last month was “The Cost of Cloud, A Trillion Dollar Paradox”. Interestingly, a lot of the response to the article seemed to miss the statement “The point of this post isn’t to argue for repatriation…”. Worth a read, but you should make sure to read all the way to the end. Also, we had our first article syndicated in The Register, and we’ll have more of those to come.