Regular readers will have noticed that service meshes show up in a lot of our posts (e.g. here and here). We have watched the space with interest for several years, and with last week’s announcement of Cilium Service Mesh, we decided it’s time to dig into the performance of service meshes.
One of the reasons that we pay so much attention to the development of service mesh technologies is that we believe this space is absolutely central to the future of networking. When networks only connected physical machines, we needed physical networks. Then with the rise of compute virtualization, we needed network virtualization to provide networking services to virtual machines. Now, with the rise of microservice architectures and APIs for everything, we need networking tools that are suited to the task of interconnecting services that communicate via APIs. To be clear, service meshes do a lot more than “just” networking–providing observability, for example–but then again, networking equipment has done a lot more than provide connectivity for many years as well.
One aspect of service meshes that we have glossed over in our prior posts is the non-trivial impact they can have on performance. For a number of good reasons, the standard way to implement a service mesh is to use a set of proxies (often called sidecar proxies as they sit alongside applications) as illustrated in the figure. Notably, the use of proxies saves the developer of a service from having to implement service mesh functionality, which remains the responsibility of the platform team. This team can then rely on standard mechanisms across all services (no matter who developed them or what language they used) to deliver platform capabilities such as observability, access control, traffic management and so on. Plus you can avoid opening cans of fermenting fish!
While the use of proxies makes lots of sense architecturally and operationally, they do impose a performance impact, and this has become an increasing concern as service mesh adoption grows and the functionality of the proxies becomes more complex. As Thomas Graf pointed out in an early talk on Cilium, when a microservice communicates with a proxy, traffic passes through the TCP stack on the way into and out of the proxy, consuming processing resources and accumulating latency as it goes. For example, one set of tests from the Isovalent team shows latency through the Envoy proxy about four times greater than native latency without the proxy. And this latency isn’t just an inconvenience: one of the roles of a service mesh is to measure latency of requests and responses among services, so the last thing you want is to increase latency as soon as you try to measure it.
This is a classic tradeoff between performance and modularity of the sort that we’ve seen since the early days of networking. (See Clark and Tennenhouse on Integrated Layer Processing for an example from 1990 which has influenced my thinking ever since.) What we are seeing now is a quest to improve the performance of service meshes without giving up the modularity benefits. One of the approaches to address this challenge is Cilium, which can offer the functionality of service meshes at a considerably lower performance cost than existing sidecar proxies.
The key technology at the heart of Cilium is eBPF. The roots of eBPF are in the old Berkeley Packet Filter (BPF) which offered a programmable way to examine packets passing through the kernel. (These days eBPF isn’t an acronym for anything.) What eBPF provides is a safe way to insert code into the linux kernel without modifying the kernel code itself. eBPF programs are written in platform-independent bytecode, which is compiled and verified before being executed. This verification ensures that the program is guaranteed to complete, will not crash the kernel, and has appropriate privileges. eBPF programs operate in a sandboxed environment that limits the set of operations that they can perform. eBPF has proven to be a flexible way to allow innovations to make their way safely into the kernel without the lengthy cycle of modifying the actual kernel code.
Both Cilium and eBPF have been around for a while–the first talk I heard on Cilium was in 2017 and eBPF goes back even further–and using Cilium to improve service mesh performance seems to have been the plan for many years. So it was a bit of a surprise to me to learn that a Cilium service mesh offering was only made available in the last few weeks. One reason for this is that the traditional proxies used to build service meshes (e.g. Envoy) have a lot of features, so just getting the eBPF-based data plane to rough parity with existing proxies is plenty of work. But a service mesh also has a control plane in addition to the data plane. In fact, there are several available control planes, of which Istio is probably the most well-known. The figure above, from the Istio documentation, illustrates the control plane providing multiple functions and interacting with the data plane. So if you want to offer a new data-plane approach to service mesh, which is what Cilium does, it’s necessary to integrate that data plane with one or more control planes, another non-trivial effort.
Further complicating matters is that not every feature that exists in today’s service meshes can be readily implemented using eBPF, so there are cases (such as TLS termination) where a proxy is still required. This leads to a hybrid approach where some traffic is handled by eBPF code while other traffic is still sent to a proxy.
Interestingly, even if proxies can’t be eliminated entirely when using eBPF, the number of proxies required can be dramatically reduced. Whereas a typical service mesh in a kubernetes environment uses a sidecar proxy per pod, the hybrid approach enabled by eBPF can reduce the number of proxies to one per node, a considerable reduction in resource usage. This post by Liz Rice explains how that is achieved.
All of this leads me to believe that our interest in service mesh is well founded. It is going through a period of growth and innovation, and as we’ve seen many times before, you can’t treat performance as an afterthought when driving the adoption of a new technology. For example, consider how much work went into developing tunnel encapsulation formats that performed adequately for network virtualization. I believe we are now seeing a similar recognition that service meshes are here to stay and we’ll have to make sure their performance is up to the task.
A recent interview with Thomas Graf provides an interesting perspective on Cilium and its role in the cloud native world. Once again, there was a massive Internet outage–Rogers in Canada–which seems to make the case (again) that verification tools are not being adopted fast enough by operators. And it looks like you can now buy our SDN book in Chinese as well as Japanese.
Larry and Bruce - this newsletter series idea of yours is valuable as you tend to not just educate but also demolish certain myths that come up with new developments. Thanks!