If GitOps is the answer, what’s the question?
This week we are wading into the discussion around GitOps. One of the topics we have on our queue of potential books to write is cloud operations, and GitOps is clearly one of the solutions-du-jour in that area. As with so many technologies where there are strong opinions pro and con, the reality probably lies in the middle.
It’s not hard to form the impression that building and deploying cloud native systems is rapidly becoming a solved problem, with GitOps providing the roadmap. The approach revolves around the idea of configuration-as-code: making all configuration state declarative (e.g., specified in Helm Charts and Terraform Templates); storing these files in a code repo (e.g., GitHub); and then treating this repo as the single source of truth for building and deploying a cloud native system. It doesn’t matter if you patch a Python file or update a config file, the repo triggers a fully automated CI/CD pipeline.
Having built cloud native systems using this model (for example, see Aether, a 5G-Enabled Edge Cloud), the power of this GitOps model is clear—It provides a straightforward approach to the thorny problem of managing configuration state. But like any seemingly straightforward solution, there’s more to the story. As others have already pointed out, GitOps isn’t a silver bullet. In our experience, there are at least three considerations that lead us to this conclusion. All hinge on the question of whether all state needed to operate a cloud native system can be managed entirely with a repository-based mechanism.
The first consideration is that we need to acknowledge the difference between people who develop software and people who build and operate systems using that software. DevOps (in its simplest formulation) implies there should be no distinction. In practice, developers are often far removed from operators, or more to the point, they are far removed from design decisions about exactly how others will end up using their software. For example, software is usually implemented with a particular set of use cases in mind, but it is later integrated with other software to build entirely new cloud apps that have their own set of abstractions and features, and correspondingly, their own collection of configuration state. (This is true for Aether, where the Software Defined Mobile Core was originally implemented for use in global cellular networks, but is being repurposed to support private 4G/5G in enterprises.)
While it is true such state could be managed in its own GitHub repo, the idea of configuration management by pull request is overly simplistic. There are both low-level (implementation-centric) and high-level (application-centric) variables; in other words, it is common to have one or more layers of abstraction running on top of the base software. In the limit, it may even be an end-user (e.g., an enterprise user in Aether) that wants to change this state, which implies fine-grained access control is likely a requirement. None of this disqualifies GitOps as a way to manage such state, but it does raise the possibility that not all state is created equal—that there is a range of configuration state variables being accessed at different times by different people with different skill sets, and most importantly, needing different levels of privilege.
The second consideration has to do with where configuration state originates. For example, consider the addresses assigned to the servers assembled in a cluster, which might originate in an organization’s inventory system. Or as in the case of a 5G service like Aether, there are unique identifiers assigned to mobile devices that are managed in a global subscriber database. In general, systems often have to deal with multiple—sometimes external—sources of configuration state, and knowing which copy is authoritative and which is derivative is inherently problematic. There is no single right answer, but situations like this raise the possibility that the authoritative copy of configuration state needs to be maintained apart from any single use of that state. Building into all components the assumption that they are not the authoritative source for any configuration parameters they use is a good place to start. The idea of a “single source of truth”, while attractive, misses some of the complexity we find in real deployments.
The third consideration is how frequently this state changes, and hence, potentially triggers restarting or possibly even re-deploying a set of containers. Doing so certainly makes sense for “set once” configuration parameters, but what about “runtime settable” control variables? What is the most cost-effective way to update system parameters that have the potential to change frequently? Again, this raises the possibility that not all state is created equal, and that there is a continuum of configuration state variables.
These three considerations point to the distinction between build-time configuration state and runtime control state. We emphasize, however, that the question of how to manage such state does not have a single correct answer; drawing a crisp line between “configuration” and “control” is notoriously difficult. Both the repo-based mechanism championed by GitOps and runtime control alternatives provide value, and it is a question of which is the better match for any given piece of information that needs to be maintained for a cloud to operate properly.
Runtime State vs Configuration State
Configuration state is reasonably well-defined, but what do we mean by Runtime State? In general, runtime state is more dynamic. If we take the example of Kubernetes, the configuration is declared in a YAML file, but runtime state is handled by controllers that must respond quickly to events such as the failure of a pod. No-one imagines that spinning up a new pod after a failure would be handled by a GitOps pull request; but the YAML file to declare how many pods should be running is an example of configuration state that sits well in the GitOps framework. In this example the distinction between configuration and runtime state is fairly obvious, but in practice it can be more of a continuum.
Maintenance of runtime state requires an appropriate control mechanism (as in the preceding example). We are building such a control mechanism for Aether (described here), and without getting into the details, the central idea is to leverage a network device configuration micro-service, retargeted at virtual devices (aka, software services). Such mechanisms have several nice properties: (1) they use YANG as the declarative specification language, and so come with a rich toolset for defining and manipulating data models; (2) they support versioning, so state changes can be rolled forward and backward; (3) they are agnostic as to how data is made persistent, but are typically paired with a cloud native key/value store; and (4) they support role-based access controls (RBAC), so different principals can be given different visibility into and control over the control/configuration parameters.
Apart from being designed to manage more dynamic runtime control state, the Control API that can be auto-generated from the YANG data model has delivered two advantages in our environment: (1) RBAC helps support the principle of least privilege, and (2) it provides an opportunity to implement early parameter validation and security checks (thereby catching errors closer to the user and generating more meaningful error messages). Effective data models have proven invaluable, a topic that we’ll return to in a later post.
So is there a single best mechanism? Almost certainly there is a need for both, decided on a case-by-case basis: Runtime Control maintains authoritative state for some parameters and the code repos maintain authoritative state for other parameters. We just need to be clear about which is which, so each backend component knows which “configuration path” it needs to be responsive to. Is there a big takeaway from all this? Only that no one said state management was going to be easy, and you should beware of anyone who claims otherwise!
While we were putting the finishing touches on this post, a couple of timely and relevant articles showed up. First, “GitOps is a Placebo” which makes the valid points that (a) GitOps codified a lot of ideas that predated it (b) it’s not the one-size-fits-all solution some have claimed. There’s a lot of good background in there too. This was followed by a balanced response “GitOps Demystified”. Both worth a read if you want to go deeper on this topic.
In unrelated news, our first online course is out. If you want to get started understanding the Magma project that we discussed in our last newsletter, EdX has you covered.