TCP x 5G: Mind the Gap
While we’ve been working on both TCP congestion control and wireless networking for some time, only recently did we start to look at the interactions between the two. This week’s newsletter looks at this intersection, which is a prime example of how systems thinking can be applied.
As we tie up the loose ends in the TCP Congestion Control book, I found myself reviewing the literature on optimizing TCP for wireless networks, and wondering if 5G is going to change the equation. The short answer, I believe, is yes, but the reason for that (and the history behind that reason) is fascinating. The place you have to start is to recognize that the mobile cellular network and the Internet evolved in parallel, with each treating the other as an opaque box. Each generation of the cellular network admitted a few more Internet-based mechanisms into its internal structure, but 5G is now embracing some of the best practices in building scalable Internet services. That puts us on the cusp of finally being able bridge the gap between these two global networks. By applying a systems view to the increasingly common case where TCP runs over wireless links, we can stop treating the Internet and 5G as distinct worlds, and improve the end-to-end performance of the overall system.
Both networks provide global connectivity, with the Internet traditionally treating the cellular network as an opaque last-hop technology, and the global cellular network using the Internet as a backbone interconnecting RAN aggregation points around the world. From the perspective of the TCP congestion control algorithm trying to find the available end-to-end bandwidth, the RAN has always been problematic. This is for three main reasons: (1) the wireless link has typically been the bottleneck in the end-to-end path due to the scarcity of radio spectrum, (2) the bandwidth available in the RAN can be highly variable due to a combination of device mobility and environmental factors, and (3) this environmental variability can also lead to the basestation retransmitting corrupted segments, which results in variable latency.
To further complicate matters, the internals of the RAN have been largely closed and proprietary, with vendors treating their radio scheduling algorithms as a critical piece of their intellectual property. Researchers have experimentally observed that there is significant buffering at the edge, presumably to absorb the expected contention for the radio link, and yet keeping sufficient work "close by" for whenever capacity does open up. (For example see a paper by Haiqing Jiang and colleagues in the 2012 CellNet Workshop.) This large buffer is an instance of bufferbloat, which is problematic for TCP because it causes the sender to overshoot the actual bandwidth available on the radio link, and in the process, introduces significant delay and jitter.
Other researchers, notably the authors of the BBR congestion control algorithm, have observed that the scheduler for the wireless link actually uses the number of queued packets for a given client as an input to its scheduling algorithm, rewarding senders for building up a queue by increasing the bandwidth they receive. BBR attempts to take advantage of this incentive in its design by being aggressive enough to queue at least some packets in the buffers of wireless links (but that aggression is not universally fair).
Given this fundamental tension between the RAN scheduler needing to keep the packet buffer full so it can exploit capacity that becomes available (as part of a ~1ms control loop) and the TCP congestion control mechanism pacing transmissions to match the available end-to-end bandwidth (as part of a ~100ms control loop), an optimal solution is likely not attainable. But is there an opportunity to do better? I believe the answer is yes.
There are three specific reasons for that optimism (which I’ll get to in a moment), but they all hinge on acknowledging that the opaque wall between the Internet and the mobile cellular network is an anachronism. In particular, with TCP traffic now being such an important use case for 5G, the focus needs to shift to delivering end-to-end goodput and maximizing the throughput/latency ratio, rather than focusing solely on maximizing the utilization of the radio spectrum. The latter is, of course, important, but it is a means to an end; not the ends, itself. Now to the three specific changes on the horizon.
First, with open software-based implementations of the RAN becoming a reality (see our companion 5G and SDN books for more details), it will soon be possible to take a cross-layer approach to congestion control, whereby the RAN provides an interface that give higher layers of the protocol stack visibility into what goes on inside its queues and scheduler. Incumbent vendors will likely be hesitant to reveal too much information (which I discuss in a related post), but high-level signals that support Active Queue Management should be sufficient to make a difference.
Second, 5G deployments are promising to support network slicing, a mechanism that isolates different classes of traffic. Each slice will have its own queue, and that queue can be sized and serviced in traffic-specific ways. As a consequence, long-distant TCP flows need not be scheduled in the same way as, say, local IoT traffic. Moreover, assuming there is sufficient opportunistic traffic to keep spectrum utilization high, it should be possible to reduce the variability in the bandwidth set aside for TCP traffic, especially those flows with high round-trip times. One caveat is that support for slicing is still mostly aspirational, but this use case is an example of how it might provide value.
Third, it will become increasingly common for 5G-connected devices to be served from a nearby edge cloud rather than from the other side of the Internet. This means TCP connections will likely have much shorter round-trip times, which will make the congestion control algorithm more responsive to changes in the available capacity in the RAN. As we discussed in a recent post, this “segmentation” of the end-to-end path is already starting to happen, but we expect it to become even more pronounce as on-premises edge clouds start to host both the RAN’s central elements and application end-points. (See Aether for an example of a 5G-enabled edge cloud.)
There are no guarantees, of course, and there will be other variables in play (e.g., small cells will likely reduce the number of flows competing for bandwidth at a given basestation), but these factors point to there being ample opportunities to tune congestion control algorithms well into the future. The big takeaway, though, is that these opportunities are the consequence of taking a big-picture (systems) approach to the problem, working across traditional boundaries instead of myopically trying to optimize one narrow aspect of the system as a whole.
We’ve now been producing this newsletter every two weeks (fortnightly!) since February, and we are going to take a little time off over the holiday period. We’ll be back in January with fresh content. If you need some holiday reading, perhaps you would like to review our latest books that are still in the draft stage: Operating an Edge Cloud: A Systems Approach, and TCP Congestion Control: A Systems Approach. You can raise issues, submit PRs, or check out the TODO lists on GitHub (ops, tcpcc). Or check out our older posts which seem to keep finding fresh relevance, such as “What can we learn from Internet Outages” or “The Accidental SmartNIC”, which inspired Bruce’s recent keynote at the Euro P4 Workshop. See you in the New Year.