Applying a Systems Lens to Software Testing

What Should Our Acceptance Criteria Be?

Oct 30, 2023

A large chunk of the bandwidth of the Systems Approach team has been consumed in recent months by trying to increase the robustness and ease of use of Aether, an edge cloud platform for delivering private 5G service. And of course you can’t have a robust, usable platform without testing, which has proven to be something of a pain-point. So that has provided the inspiration for this week’s post, as we try to bring the Systems lens to testing.

Perhaps the single biggest aspect of systems building I’ve come to appreciate since shifting my focus from academic pursuits to open source software development is the importance of testing and test automation. In academia, it’s not much of an overstatement to say that we teach students about testing only insofar as we need test cases to evaluate their solutions, and we have our grad students run performance benchmarks to collect quantitative data for our research papers, but that’s pretty much it. There are certainly exceptions (e.g., Software Engineering focused curricula), but my experience is that the importance placed on testing in academia is misaligned with its importance in practice.

I said I appreciate the role of software testing, but I’m not sure that I understand it with enough clarity and depth to explain it to anyone else. As is the nature of our Systems Approach mindset, I’d like to have a better understanding of the “whys”, but mostly what I see and hear is a lot of jargon: unit tests, smoke tests, soak tests, regression tests, integration tests, and so on. The problem I have with this and similar terminology is that it’s more descriptive than prescriptive. Surely an integration test (for example) is a good thing, and I can see why you could reasonably claim that a particular test is an integration test, but I’m not sure I understand why it’s either necessary or sufficient in the grand scheme of things. (If that example is too obscure, here’s another example posted by a Reddit user.) The exception might be unit tests, where code coverage is a quantifiable metric, but even then, my experience is that more value is being put on the ability to measure progress than its actual contribution to producing quality code.

With that backdrop, I have recently found myself trying to perform triage on the 700+ QA jobs (incurring substantial monthly AWS charges) that have accumulated over the last five years on the Aether project. I don’t think the specific functionality is particularly important—Aether consists of four microservice-based subsystems, each deployed as a Kubernetes workload on an edge cloud—although it probably is relevant that the subsystems are managed as independent open source projects, each with its own team of developers. The projects do, however, share common tools (e.g., Jenkins) and feed into the same CI/CD pipeline, making it fairly representative of the practice of building systems from the integration of multiple upstream sources.

What is clear from my “case study” is that there are non-trivial tradeoffs involved, with competing requirements pulling in different directions. One is the tension between feature velocity and code quality, and that’s where test automation plays a key role: providing the tools to help engineering teams deliver both. The best practice (which Aether adopts) is the so-called Shift Left strategy: introducing tests as early as possible in the development cycle (i.e., towards the “left” end of CI/CD pipeline). But Shift Left is easier in theory than in practice because testing comes at a cost, both in time (developers waiting for tests to run) and resources (virtual and physical machines needed to run the tests).

What Happens in Practice?

In practice, what I’ve seen is heavy dependency on developers manually running component-level functional tests. These are the tests most people think of when they think of testing (and when they post jokes about testing to Reddit), with independent QA engineers providing value by looking for issues that developers miss, yet still failing to anticipate critical edge cases. In the case of Aether, one of the key functional tests exercises how well developers have implemented the 3GPP protocol spec, a task so complex that the tests are commonly purchased from a third-party vendor. As for automated testing, the CI/CD pipeline performs mostly pro forma tests (e.g., does it build, does it have the appropriate copyright notice, has the developer signed the CLA) as the gate to merging a patch into the code base.

That puts a heavy burden on post-merge integration tests, where the key issue is to ensure sufficient “configuration coverage”, that is, validating that the independently developed subsystems are configured in a way that represents how they will be deployed as a coherent whole. Unit coverage is straightforward; whole-system coverage is not. For me, the realization that configuration management and testing efficacy are deeply intertwined is the key insight. (It is also why automating the CI/CD pipeline is so critical).

To make this a little more tangible, let me use a specific example from Aether (which I don’t think is unique). To test a new feature—such as the ability to run multiple User Plane Functions (UPF), each serving a different partition (Slice) of wireless devices—it is necessary to deploy a combination of (a) the Mobile Core that implements the UPF, (b) the Runtime Controller that binds devices to UPF instances, and (c) a workload generator that sends meaningful traffic through each UPF. Each of the three components comes with its own “config file”, which the integration test has to align in a way that yields an end-to-end result. In a loosely coupled cloud-based system like Aether, integration equals coordinated configuration.

Now imagine doing that for each new feature that gets introduced, and either the number of unique configurations explodes, or you figure out how to get adequate feature converge by selectively deciding which combinations to test and which to not test. I don’t have a good answer as to how to do this, but I do know that it requires both insider knowledge and judgment. Experience also shows that many bugs will only be discovered through usage, which says to me that pre-release testing and post-release observability are closely related. Treating release management (including staged deployments) as just another stage of the testing strategy is a sensible holistic approach.

Going back to where I started—trying to understand software testing through the systems lens—I don’t think I’ve satisfied my personal acceptance criteria. There are a handful of design principles to work with, but the task still feels like equal parts art and engineering. As for the triage I set out to perform on the set of Aether QA jobs, I’m still struggling to separate the wheat from the chaff. That’s a natural consequence of an evolving system, without a clear plan to disable obsolete tests. This remains a work-in-progress, but one clear takeaway is that both students and practitioners would be well-served by having a rigorous foundation in software testing.

With energy costs comprising up to 10% of their operating costs, it should be no surprise that network operators are looking for ways to save energy. The SMaRT-5G Project at ONF has recently published a whitepaper laying out a broad research agenda to address the challenge.

Preview photo this week by Bruce Davie

Systems Approach

Applying a Systems Lens to Software Testing

What Should Our Acceptance Criteria Be?

What Happens in Practice?

Discussion about this post