Applying Cloud Native Principles to Wireless Networks
As part of my work on the Magma project, I hear from the team about the challenges they face in getting network operators to understand how Magma differs from existing commercial and open source solutions to building mobile wireless networks. One of the persistent areas of confusion is what it means for a mobile solution to be “cloud native”–a term that is quite loaded in any context. In this post, co-authored with Magma project leader Amar Padmanabhan, we look at how the Magma team applied cloud native principles to the mobile core architecture.
Discussions about mobile and wireless networking seem to attract buzzwords, especially with the transition to 5G. And hence we see a flurry of “cloudification” of mobile networking equipment—think containers, microservices, and control and user plane separation (CUPS). But making an architecture cloud native is more than just an application of buzzwords: it entails a number of principles related to scale, tolerance of failure, and operational models. And indeed it doesn’t really matter what you call the architecture; what matters is how well it works in production. In this post we have tried to articulate some of the defining traits that have guided the development of Magma, in pursuit of a cost-effective and easy-to-operate solution for mobile networks in less well-developed areas.
Commodity hardware
Most traditional networking equipment is proprietary, bundling the software with precisely specified and configured hardware. But Magma, like most cloud native systems, leverages low-cost, commodity hardware. Performance is achieved by the use of scale-out approaches, and reliability is achieved by software techniques that deal with failures of unreliable hardware. Scale-out and planning for failure are themselves key tenets of cloud native architecture, as discussed below.
From its inception, Magma has been designed to be easy to operate in diverse environments on commodity hardware. Any component can be replaced with minimal cost and disruption to the network.
Scale-out rather than scale-up
Cloud native systems typically achieve scale by horizontally adding more commodity devices, rather than scaling up the capacity of individual, monolithic systems. Magma is based on a distributed architecture that scales out. Capacity is increased by adding small devices throughout the network—for example, networks can deploy hundreds of access gateways around radio towers instead of adding a few large boxes to the core as is common in traditional EPC designs. This distributed architecture is also important when it comes to our next point, designing for failure.
Small fault domains
In any cloud system, it is expected that individual components will fail, and failure is treated as a common part of the operational flow of the system. Many of Magma’s design decisions flow from that premise. In traditional telco architectures, by contrast, failures are assumed to be rare and are handled through specific exception paths, like hot standbys and fully redundant services.
A failure should impact as few users as possible (i.e., fault domains must be small) and should not affect other components. For example, a failure in a small access gateway might affect only a few hundred customers. Conversely, if a network built upon two large cores has a failure in one of its cores, half of its customers may lose service.
It’s not sufficient to just divide a large monolith into smaller components. You also need to localize the state within components to limit the impact of failures. Magma does this by localizing the state associated with any given user equipment (UE) to a single access gateway. Thus, the impact of component failure is limited: only the UEs served by a given access gateway are impacted. The access gateway is the location for per-UE “runtime state,” which depends on events such as the powering on of UEs or movement of a UE to the coverage area of a new base station. By contrast, UE runtime state tends to be spread among components in traditional 3GPP implementations.
While runtime state is localized to the relevant access gateway, configuration state is stored centrally in the Magma orchestrator, because configuration is a network-wide property that is provided through the central API. If a component of the orchestrator fails, it only prevents updates to configuration, but does not impact the runtime state; hence, UEs can continue to operate even if the orchestrator is restarting.
Simplified operations
The scalability of cloud native systems applies as much to operations as to performance. Centralized control planes, such as those found in software-defined networks (SDN), have emerged as a way to simplify the operations of networks. Indeed, Magma was influenced by the experience of building Nicira’s SDN system. Whereas centralized control was once viewed as an unacceptable single point of failure or a scaling bottleneck, it is now well understood that reliable, logically centralized controllers can be built from a collection of commodity servers. It’s much simpler to operate a whole network when it is viewed from a central point of control, rather than determining how to configure each network device individually.
The logically centralized point of control in Magma is the orchestrator, and it loosely corresponds to a controller in an SDN system. It is implemented on a set of machines (typically three), any one of which may fail without bringing down the orchestrator. This set of machines exposes a single API from which the network as a whole may be configured and monitored. As the size of the network increases and more access gateways are added, the operator maintains a single, centralized view of the network.
Access gateways represent a distributed data plane implementation and also contain local control-plane components. The federation gateway implements a set of standard protocol interfaces to allow a Magma-enabled system to interoperate with standard cellular networks.
Like SDN systems, Magma separates the control plane and data plane. This is essential to the simplified operational model, but also affects reliability. Failure of a control-plane element does not cause the data plane to fail, although it may prevent a new data-plane state from being created (e.g., bringing new UE online). This is a more complete separation than provided by the CUPS specification of 3GPP as we’ve discussed previously.
Magma’s data plane, running in the access gateways, is software-controlled and programmed through well-defined and stable interfaces (similar in principle to OpenFlow) that are independent of hardware. Again, this is the same approach as taken in SDN and enables the data plane to leverage commodity hardware and easily evolve over time.
Desired state model
Similar to many cloud native systems (like Kubernetes), Magma uses a desired state configuration model. APIs allow users or other software systems to configure the intended state, while the control plane is responsible for ensuring that this state is realized. The control plane takes on the responsibility of mapping from the operator’s intended state to the actual implementation in the access gateways, reducing the operational complexity of managing a large network.
The desired state model has been proven effective in other cloud native contexts because it enables components to simply compare the current state with the desired state, and then make adjustments as needed. For example, if the desired state is that there are two sessions active, the control plane monitors the system to ensure it meets the requirement and takes action to activate a session if it is missing. (Here is Joe Beda with my favourite description of how desired state works in Kubernetes.) By contrast, traditional 3GPP implementations have used a “CRUD” (create, read, update, delete) model, in which a sequence of actions (create new session or update session, for example) determine the state. In the event of a message loss or component failure, it becomes hard to determine if the current state of a component is correct.
While many of these design decisions might seem obvious, they are fairly different from the principles of the standard 3GPP architecture. For example, even though 3GPP has the concept of CUPS, control plane elements often hold some user plane (data plane) state. A proper separation of control and data planes leads to a more robust architecture with better upgradability.
Ultimately, it doesn’t matter what we call the architecture. What matters is how well it scales, how it handles failures, and how easy it is to operate. By adopting technologies and principles from cloud services, Magma delivers a mobile network solution that is robust, leverages commodity hardware, scales from small to large deployments gracefully, and offers a central point of control for operational simplicity. We’re gathering operational data as I write this; our goal is to quantify these claims in a later post.
With our new printing (and ebook) of “Software-Defined Networking: A Systems Approach” out of the way, last week we moved our attention back to the draft of “TCP Congestion Control: A Systems Approach”. The week’s big effort was on BBR (Bottleneck Bandwidth and RTT) which builds on much of the TCP (and Bufferbloat) research of the last 25 years, and continues to undergo active research and development. We’d love to hear from our readers on areas that need further work or topics we might have missed. Coming up, Bruce will be presenting at the 4th P4 Workshop in Europe and it’s likely to involve IPUs and SmartNICs.