Observability Joins the List of Essential System Properties
An exciting development in the Systems Approach world this week is the announcement that the Japanese version of our SDN book is now available to pre-order. This has been a huge effort from our colleagues Motonori Shindo, Kentaro Ebisawa, and Masayuki Kobayashi. This is not just a translation, but includes content specific to the Japanese market, where SDN has been particularly successful. Meanwhile, we’ve been hard at work putting finishing touches on our Edge Cloud Operations book, which provided the inspiration for this week’s post.
While working to complete the Monitoring and Telemetry chapter of our Edge Cloud Operations book, I couldn’t help but notice all the hype on the Internet about observability, and especially how many sites were eager to explain to me why observability is so much more than monitoring. The message is clear: anyone who is satisfied with just monitoring their cloud services will miss the boat. But what isn’t clear is exactly why.
None of those sites described observability as another of the set of "-ities" (qualities) that all good systems aspire to—alongside scalability, reliability, availability, security, usability, and so on—but that’s where I would start. Observability is simply the property of a system that makes visible the facts about its internal operation needed to make informed management and control decisions. That leads to the obvious question: How? What are best practices and techniques in achieving observability?
At a mechanistic level, the answer is straightforward: instrument your code to produce useful facts that are then collected, aggregated, stored, and made available (via query) for display and analysis. Where the problem starts to become interesting is in answering the question: What data is useful? Again, the answer seems to be well-understood, at least at a high level, with three types of telemetry data typically identified: metrics, logs, and traces. (I would add a fourth—being able to query the flow-level state in the network data plane using Inband Network Telemetry (INT)—which I’ll return to at the end.)
Metrics are quantitative data about a system. These include common performance metrics such as link bandwidth, CPU utilization, and memory usage, but also binary results corresponding to "up" and "down". These values are produced and collected periodically (e.g., every few seconds), either by reading a counter, or by executing a runtime test that returns a value. These metrics can be associated with physical resources such as servers and switches, virtual resources such as VMs and containers, or even end-to-end cloud services.
Logs are the qualitative data that is generated whenever a noteworthy event occurs. This information can be used to identify problematic operating conditions (i.e., it may trigger an alert), but more commonly, it is used to troubleshoot problems after they have been detected. Various system components—all the way from the low-level OS kernel to high-level cloud services—write messages that adhere to a well-defined format to the log. These messages include a timestamp, which makes it possible for the logging stack to parse and correlate messages from different components.
Traces are a record of causal relationships (e.g., Service A calls Service B) resulting from user-initiated transactions or jobs. They are a form of event logs, but provide more specialized information about the context in which different events happen. Tracing is well understood in a single program, but in a cloud setting, a trace is inherently distributed across a graph of network-connected microservices. This makes the problem challenging, but also critically important because it is often the case that the only way to understand time-dependent phenomena—such as why a particular resource is overloaded—is to understand how multiple independent workflows interact with each other.
For each of these types of telemetry data, the central challenge is to define a meaningful data model, so there is agreement across the many components that go into an end-to-end solution. Now we’ve dug down to the essential technical problem, and while there may one day be universal agreement about those data models—active open source efforts like OpenTelemetry are working toward such definitions—as a practical matter, cloud services are built from components pulled from many sources, each of which have adopted a different instrumentation practice. So today the state-of-the-art is to write filters that translate one format into another (and to hope that components outside your control are well instrumented).
If all of this sounds obvious, it is only so in retrospect. Hype about the latest hot topic is largely an exercise in creating the illusion of differentiation and uniqueness. There is little value in finding commonality. Returning to the question of monitoring vs observability, how you answer it depends on how you define terms. My view is that there are two broad use cases for telemetry data: (1) proactively watching for warning signs of trouble (e.g., attacks, bugs, failures, overload conditions) in a steady-state system; and (2) reactively taking a closer look to determine the root cause and resolve an issue (e.g., fix a bug, optimize performance, provision more resources, defend against an attack) once a potential problem is flagged. I tend to refer to the overall problem space as “monitoring”, but if we call the first use case monitoring and the second use case troubleshooting, then no one would suggest that monitoring without troubleshooting is a viable approach. Observability is simply a prerequisite for both.
One thing that does seem to be happening under the “observability banner” is to recognize that “always on” tracing is an essential part of troubleshooting. This should not come as a surprise; a paper describing Google’s experience with the Dapper tracing tool dates back to 2010, and Microsoft described their Sherlock tool for debugging enterprise networks in a 2007 SIGCOMM paper. Tracing requests through a microservice architecture is in many ways the cloud’s version of profiling, a practice with a long history in both debugging and performance tuning. It is certainly the case that diagnostic scenarios that benefit from tracing and profiling are among the thorniest a DevOps team will ever encounter (he said from experience).
Having access to the right data is essential, and every attempt should be made to achieve observability, but it is also important to acknowledge that troubleshooting is an inherently human-intensive and context-dependent process. You often need to be an expert to know what questions to ask, which means it’s difficult to fully “platform-tize” the problem away. I suspect this also contributes to the reluctance of some developers to embrace a one-size-fits-all observability framework, as discussed in a recent tweet thread. If we’re going to try to convince developers to always use the best possible tools, we just as well start by asking them to program in a strongly typed language like Standard ML or Haskell. Not going to happen. An equally sticky problem is that component developers are motivated to make sure their components are correct (get past the “it runs on my box” stage), but have less stake in end-to-end operational behavior, especially when it’s someone else’s deployment. There is a crack in the DevOps storyline, where incentives are not always aligned.
So beyond best-effort observability, what are our options? A couple of related ideas come to mind. One is service meshes, which are increasingly being positioned as a tool for observability. A large part of the appeal of service meshes is that they sit close enough to an application to see in detail what is coming in and out of it (e.g. API requests and responses) without requiring the application developer to insert additional code to achieve that visibility. Examples of this can be found in Istio and the recently announced Tetragon project.
Another is to build on INT. Whereas network devices historically were only minimally instrumented (e.g. with packet counters), INT builds on the rise in programmable networking hardware to enable fine-grained and programmable instrumentation of the data plane.
In both cases, the key insight is that it’s difficult to know in advance (a) what questions you will want to ask, and (b) what specific workflows to zoom in on. Being able to programmatically re-instrument the code path (data plane) is sometimes the only option. Supporting that kind of flexibility is consistent with the ultimate goal of observability, but knowing where to embed “programmable instruments” is the $64,000 question. While we still have a lot of work to do, it’s clear that Observability has achieved top billing with all the other desirable qualities of distributed systems, and the tools are rapidly evolving to keep up with the demand.
Following up on our earlier post about APIs, we sat down for a conversation with financial services CIO Tim Batten to discuss how API adoption proves challenging for businesses – here is the video. We also received a delightful foreword to our TCP Congestion Control book from the legendary computer scientist and textbook author Jim Kurose, which means we’re going to be producing ebook and print versions of that book very soon.