Since our recent posts have been retracing the history (and recounting the lessons) of SDN, it seems like a good time to do the same with the other sea-change in networking over the last decade: Network Functions Virtualization (NFV). This turns out to be yet another example of the importance of tackling operations as a central aspect of cloud systems (which happens to be the topic of our latest book.)
Most people trace NFV back to the call-to-action several network operators presented at the Layer123 SDN & OpenFlow World Congress in October 2012. My involvement happened over several months before that presentation, working with colleagues from BT, Intel, HP, Wind River, and TailF on the Proof-of-Concept that gave the idea enough credibility for those operators to put the NFV stake in the ground. At the time, I was at Verivue (a startup that sold CDN software to telcos), which contributed what we believed to be the canonical Virtualized Network Function (VNF)—a scalable web cache. Don Clarke, who led the entire effort on behalf of BT, has an excellent 4-part history of the journey, of which my own direct experience was limited to Part 1.
Two things stand out about the PoC. The first is the obvious technical goodness of co-locating an access network technology (such as a virtualized Broadband Gateway) and a cloud service (such as a scalable cache) on a rack of commodity servers. The PoC itself, as is often the case, was hamstrung by our having to cobble together existing software components that were developed for different environments. But it worked and that was an impressive technical accomplishment. The second is that the business minds in the room were hard at work building an ROI (return on investment) case to pair with the technical story. Lip service was given to the value of agility and improved feature velocity, but the cost-savings were quantifiable, so they received most of the attention.
Despite being widely heralded at the start, NFV has not lived up to its promise. There have been plenty of analyses as to why (for example, see here and here), but my take is pretty simple. The operators were looking at NFV as a way to run purpose-built appliances in VMs on commodity hardware, but that’s the easy part of the problem. Simply inserting a hypervisor between the appliance software and the underlying hardware might result in modest cost savings by enabling server consolidation, but it falls far short of the zero-touch management win that modern datacenter operators enjoy when deploying cloud services. In practice, telco operators still had to deal with N one-off VM configurations to operationalize N virtualized functions. The expectation that NFV would shift their operational challenge “from caring for pets to herding cattle” did not materialize; the operators were left caring for virtualized pets.
Streamlining operational processes is hard enough under the best circumstances, but the operators approached the problem with the burden of preserving their legacy O&M practices (i.e., they were actively avoiding changes that would enable streamlining). In essence, the operators set out to build a “Telco Cloud” through piecemeal adoption of cloud technologies (starting with hypervisors). As it turned out, however, NFV set in motion a second track of activity that is now resulting in a “Cloud-based Telco”. Let me explain the difference.
Looking at the NFV PoC with the benefit of hindsight, it’s clear that standing up a small cluster of servers to demo a couple of VNFs side-stepped the real challenge, which is to repeat that process over time, for arbitrarily many VNFs. This is the problem of continuously integrating, continuously deploying, and continuously orchestrating cloud workloads, which has spurred the development of a rich set of cloud native tools like Kubernetes, Helm, and Terraform. Such tools weren’t generally available in 2012 (although they were emerging inside hyperscalers), and so the operators behind the NFV initiative started down a path of (a) setting up an ETSI-hosted standardization effort to catalyze the development of VNFs, and (b) retrofitting their existing O&M mechanisms to support this new collection of VNFs. Without evaluating the NFV reference architecture point-by-point, it seems fair to say that wrapping a VNF in an Element Management System (EMS), as though it were another device-based appliance, is a perfect example of how such an approach does not scale operations.
Meanwhile, the laudable goal of running virtualized functions on commodity hardware inspired a parallel effort that existed entirely outside the ETSI standardization process: to build cloud native implementations of access network technologies, which could then run side-by-side with other cloud native workloads. This parallel track, which came to be known as CORD (Central Office Rearchitected as a Datacenter), ultimately led to Kubernetes-based implementations of multiple access technologies (e.g,. PON/GPON and RAN). These access networks run as microservices that can be deployed by a Helm Chart on your favorite Kubernetes platform, typically running at the edge (e.g., Aether).
Again, with the benefit of hindsight, it’s interesting to go back to the two main arguments for NFV—lower costs and improved agility—and see how they have been overtaken by events. On the cost front, it’s clear that solving the operational challenge was a prerequisite for realizing any CAPEX savings. What the cloud native experience teaches us is that a well-defined CI/CD toolchain and the means to easily extend the management plane to incorporate new services over time is the price of admission to take advantage of cloud economics.
On the agility front, NFV’s approach was to support service chaining, a mechanism that allows customers to customize their connectivity by “chaining together” a sequence of VNFs. Since VNFs run in VMs, in theory, it seemed plausible that one could programmatically interconnect a sequence of them. In practice, providing a general-purpose service chaining mechanism proved elusive. This is because customizing functionality is a hard problem in general, but starting with the wrong abstractions (a “bump-in-the-wire” model based on an antiquated device-centric worldview) makes it intractable. It simply doesn’t align with the realities of building cloud services. The canonical CDN VNF is a great example. HTTP requests are not tunneled through a cache because it was (virtually or physically) wired into the end-to-end chain, but instead, a completely separate Request Redirection service sitting outside the data path dynamically directs HTTP GET message to the nearest cache. (Ironically, this was true during the PoC since the Verivue CDN was actually container-based and built according to cloud native principles, even though it pre-dated Kubernetes.) A firewall is another example: in a device-centric world, a firewall is a “middlebox” that might be inserted in a service chain, but in the cloud, equivalent access-control functionality is distributed across the virtual and physical switches.
When we look at the service agility problem through the lens of current technology, a service mesh provides a better conceptual model for rapidly offering customers new functionality, with Connectivity-as-a-Service proving to be yet another cloud service. But the bigger systems lesson of NFV is that operations need to be treated as a first-class property of a cloud. The limited impact of NFV can be directly traced to the reluctance of its proponents to refactor their operational model from the outset.
Thanks to all the folks who submitted PRs or opened issues on our TCP Congestion Control book in recent weeks. We’re asymptoting towards a first print and eBook version, but you can still suggest improvements on GitHub. We also managed to get out a blog post about the opportunity to decentralize cellular networks using Magma and the Helium blockchain. And Bruce saw some pademelons while bushwalking in Tasmania.