It was a milestone week at Systems Approach as we published TCP Congestion Control: A Systems Approach in both print and ebook formats. You should be able to order it at almost any bookseller thanks to the wonders of modern print-on-demand technology. Or you can just go and read the web version, which we continue to update as technology changes and we find bugs or receive PRs on Github. Thanks to all those who contributed to this version.
Since publishing our article and video on APIs, I’ve talked with a few people on the API topic, and one aspect that keeps coming up is the importance of security for APIs. In particular, I hear the term “zero trust” increasingly being applied to APIs, which led to the idea for this post. At the same time, I’ve also noticed what might be called a zero trust backlash, as it becomes apparent that you can’t wave a zero trust wand and instantly solve all your security concerns.
Zero trust has been on my radar for almost a decade, as it was part of the environment that enabled Network Virtualization to take off. We’ve told that story briefly in our SDN book–the rise of microsegmentation as a widespread use-case was arguably the critical step that took network virtualization from a niche technology to the mainstream. In fact the term goes back at least to 2009 when it was coined by Forrester analyst John Kindervag and it is possible to draw a line back from there to the principle of least privilege as framed by Saltzer and Schroeder in 1975. That principle states:
“Every program and every user of the system should operate using the least set of privileges necessary to complete the job.”
Whereas the Internet was designed following another of Saltzer’s principles–the end-to-end argument, which he formulated with David Clark and David Reed–least privilege didn’t really make it into the Internet architecture. In fact, as David Clark pointed out some 20 years after the end-to-end paper, he and his co-authors assumed that end-systems were willing participants in achieving correct behavior, an assumption that no longer holds true. Whereas the goal of the early Internet was to interconnect a handful of computing systems running in research labs around the U.S. (initially), a substantial subset of the end-systems connected to the Internet today are actively trying to harm other systems–inserting malware, launching DoS attacks, extracting sensitive information, and so on. The last 20+ years of networking have seen an ever expanding set of attempts to deal with the lack of security in the original Internet.
For me, the easiest way to conceptualize zero trust is by considering what it is not. Perimeter-based security (as provided by perimeter firewalls for example) is a good counterexample. The idea of a firewall is that there is an inside and an outside, with systems on the inside being “trusted” and those outside being “untrusted”. This division of the world into trusted and untrusted regions fails both the principle of least privilege and the definition of zero trust. Traditionally, a device on the inside of a firewall is trusted to access lots of other devices that are also inside just by virtue of its location. That is a lot more privilege than needed to do its job, and contrary to this description of zero trust provided by NIST:
“Zero trust…became the term used to describe various cybersecurity solutions that moved security away from implied trust based on network location and instead focused on evaluating trust on a per-transaction basis.”
(As someone who has been involved in plenty of documents produced by committees, I have to say that the NIST Zero Trust Architecture is remarkably clear and well written.)
VPNs are another example of an approach to security that fails to meet this definition, because, even though modern VPN technology lets you connect to a corporate network from anywhere, it still creates the sense of an inside that is trusted and an outside that is not. The Colonial pipeline ransomware attack is an example of a compromise of a VPN with dire consequences because of the broad range of systems that were reachable once the attacker was “inside” the VPN.
My theory about the occasional backlash that I’ve seen around zero trust has two parts. First, the name is an oversimplification of what’s going on. It’s not that you literally trust nothing. But rather, trust is not assumed just because of a device’s (or user’s) location, and nor does an entity gain wide access to resources just because it was able to authenticate itself for a single purpose. So “zero-trust” might be better termed “narrow and specific trust after authentication” but that’s not very catchy.
Second, there is a lot of work to be done to actually implement zero trust comprehensively. So while a vendor might say “my product/solution lets you implement zero trust”, the reality is that there are a lot of moving parts to a comprehensive zero trust implementation, which is unlikely to be solved by one or two products.
When we were developing microsegmentation as part of our network virtualization solution at VMware, we were quick to point out that it helped with zero trust implementation by allowing fine-grained firewalling of east-west traffic. Distributed firewalls enabled us to move beyond zone-based trust (as provided by traditional firewalls) to an approach where an operator could specify precise rules for communication between any pair of VMs, and the default could be that no VM could communicate with any other VM. That default, applied to VMs even if they sat in the same zone (relative to traditional firewalls), was what enabled us to claim a “zero trust” approach. While that was quite a breakthrough in 2014, the granularity of control is limited by what is visible to the distributed firewall, and so it doesn’t really achieve the “per-transaction” evaluation of trust described above. If communication between applications is encrypted (as it should be in many if not most cases) then the granularity at which the firewall would have to operate is the TCP port, with no deeper visibility into the type of transactions happening.
Which brings us to securing APIs. As we discussed in our earlier post, the unit of infrastructure is no longer the server or the VM but the service (or microservice) and so the API to the service becomes the point of security enforcement. This is why we see things like API Gateways and service meshes becoming increasingly important: we need new classes of tools to manage the security of APIs, providing fine-grained control over exactly which API requests can be executed by whom. This is nicely explained in the video that introduced me to both service meshes and the Cilium project.
A final observation is that we are now reaping the rewards of the SDN architectural approach that combines central control with distributed data planes. Like many networking people of a certain era, I grew up learning that the end-to-end argument was the basis for all good architecture, and I was unimpressed with the rise of firewalls and other “middle-boxes” because they clearly didn’t adhere to the end-to-end principle. But over time I came to realize that firewalls were appealing because they offered a central point of control, and that was important for those operators who needed to secure the network after the fact. What we saw with the rise of distributed firewalling, and SDN more broadly, was that we could have centralized control (with the benefits that provides for operators) and a distributed implementation that pushed the necessary security functions closer to the end points, where they were more effective. Service meshes are the next step in that journey: effectively SDN for a world where APIs are the primary form of communication.
It’s been hard to miss the news that Broadcom is likely to acquire VMware. There is plenty of analysis floating around, and I really like this take from my former colleague Jared Rosoff on the distinction between being a “hardware company” (that happens to replicate hardware in software) versus being a platform company. I do worry a bit about the folks at the Wall Street Journal who just noticed the importance of virtualization. From the Systems Approach perspective, both companies have had a substantial impact on networking in areas that include bare-metal switches and SDN, so we will be watching this space with interest.
Great read Bruce. This has me considering Service Mesh further from a non-container perspective. While Service Meshes have come to light as a result of Kubernetes the concept predates the mass adoption of containers. I love the idea that the VM is no longer the focal point of application development. The challenge is the VM is still the focal point for application security for legacy workloads. Solutions such as Hashicorp Consul and Gloo are great indications of where we need to go to abstract services from the underlying VM infrastructure. However, I sense there’s a lot of cultural and process change that has to happen before a transition can occur.
Great read.
Hey Bruce, Systems Approach is a fantastic substack. Thanks to you and Larry for creating and sharing the content. Zero trust is huge, in particular, as we embark on this hybrid remote/office journey. Your refined definition of zero trust as "..narrow and specific trust after authentication.." is spot on. When I looked into this a few years back, I came across the Google BeyondCorp solution (https://cloud.google.com/beyondcorp). It appears to align with several points raised in your article, namely no VPN, centralized control and distributed deployment of proxy/rules/PEP entities and user/device authentication instead of network attachment. Google has the footprint and have enlisted the security vendor heavyweights in the BeyondCorp Alliance (https://cloud.google.com/blog/products/identity-security/google-cloud-announces-new-partners-in-its-beyondcorp-alliance). Curious on your thoughts regarding BeyondCorp as both a technical and commercial solution. Fascinating area!