RPC vs TCP (Redux)
We’ve given serious thought to the question of whether there is an entire book to be written about RPC from the perspective of networking. We have been writing about the possibilities for a request/response transport paradigm (as an alternative to TCP’s reliable byte stream) since the 1990s, and yet we seem to keep finding new angles to approach the problem. We’re not sure yet if there is a whole book in this topic, but we certainly have enough material for this week’s newsletter and probably more to come.
Reposts of our recent RPC vs TCP article generated considerable commentary (for example, see Mastodon), and some of the feedback was quite helpful. (Some comments on other sites reminded us why we don’t spend a lot of time reading comment threads.) The considerable interest in the topic convinced me that a follow-up post would be worth doing. I have three observations, which for your reading convenience, are filtered to exclude all ad hominem flamethrowing.
The first observation is that there has long been a “parallel universe” in which the High Performance Computing (HPC) community has created their own networking substrate—from communication hardware (InfiniBand) to end-to-end software (MPI, Active Messages)—without being overly concerned about broad interoperability (which is the hallmark of Internet technology). The goal is simple: maximize throughput and minimize latency, under the simplifying assumption that you have full control over both sides of the communication. This approach started on purpose-built multiprocessors, but as those architectures gave way to cloud-based commodity hardware, solutions like RDMA (Remote DMA) over Converged Ethernet (RoCE) started to get traction.
RDMA gives the sender (caller) the ability to directly address memory on the receiver (callee). This requires tight coupling between the two parties, as would be the case in a parallel program, but is less generally applicable when building distributed systems since (a) someone else is likely responsible for the service you’re calling, and (b) you can’t be sure if that service runs on the same server, another server in the same datacenter, or in a datacenter across the country. RDMA was originally included as part of InfiniBand; RoCE is a variant of that idea suitable for running over commodity ethernet. It runs on top of UDP, sacrificing performance (compared to InfiniBand) in return for supporting commodity cloud deployments.
The second observation is that there is a feature of QUIC that I had not appreciated. Christian Huitema pointed out that QUIC can be used without HTTP/3, and as a consequence, could serve as a general-purpose request/reply protocol underpinning any RPC framework. It is reassuring to see the request/reply protocol decoupled from the application domain. That’s a major step forward, and caused me to wonder if it might be possible to fold some of Homa’s latency optimizations into QUIC, giving Homa an alternative path to wider adoption. But I see two issues that will need to be addressed.
The first is that a claimed advantage of QUIC is that it runs in user space, and if that remains its dominant deployment scenario, it negates some of Homa’s latency improvements—those that are due to being kernel-resident. While QUIC can also be implemented in the kernel (for example see the work of Peng Want, et al), it’s not clear why retrofitting Homa-inspired techniques into QUIC is an improvement over natively running Homa in the kernel. There’s more to say on this topic, but I’ll save it for my third observation. The second issue centers around Homa’s approach to congestion control, which I consider its main contribution. Debates about congestion control seldom turn out well for the new kid on the block, but QUIC does do a good job of modularizing congestion control (particularly when it runs in user space), and so it may offer a viable deployment path.
Saving the best for last, my third observation is the consequence of an exchange between David Reed and John Ousterhout. David argued that “RPC was one big reason for creating UDP as a datagram ‘process addressing’ support layer” and that “...argument passing and waiting for a return was way too high level for a transport protocol.” John’s response was to call out the distinction between “RPC transport” and things like “argument processing [that] fit pretty naturally at the application level.” These two positions are consistent with the overall framing I laid out in my original post, and worth a closer look.
On the one hand, both the End-to-End Argument and Application-Level Framing (ALF)—two stalwart Internet design principles—point to the end-point (with the application process being the ultimate end-point) as knowing best. On the other hand, good system design is always looking for opportunities to carve out common functionality that can be packaged as a general-purpose tool and pushed down into the underlying platform. (The e2e paper acknowledges this tension, albeit from a performance perspective.) Doing so both frees applications from having to reinvent the wheel, and perhaps more importantly, makes sure complex functionality is implemented correctly.
Let’s apply that tradeoff to TCP and RPC. No one would argue that implementing a reliable byte stream service, bundled with a fair congestion control algorithm, should be left to the application. (Unless, of course, the application is real-time multimedia, which is exactly the use case that motivated ALF.) So why is an RPC transport any different? I don’t see how it is: Its complexity is on par with TCP’s, ensuring (enforcing) well-behaved congestion control is important, and the set of applications it supports is substantial. It’s a nearly perfect example of an end-to-end networking substrate that your OS should provide. Of course an RPC transport can always be implemented on top of UDP, but the same is true of TCP.
If we weren’t assuming a monolithic kernel, we might be able to have a different discussion. For example, an Exokernel would let me run Homa in my LibraryOS, TCP in your LibraryOS, and perhaps RoCE or QUIC in yet another LibraryOS. But that’s not the world we live in. Someone has to decide what functionality does and does not get to run in privileged mode (and by extension, in the SmartNIC or IPU), and that decision impacts performance—especially latency—which is at the heart of the case for RPC in the datacenter. The HPC community realized this years ago, and deviated from Internet standards in response. My view of Homa is that it tries to achieve similar performance in low-latency environments, but in a more interoperable way. After 40 years, it makes sense that a second design point—a request/reply transport protocol, perhaps some variant of Homa/QUIC—sits side-by-side with TCP in the kernel.
But that’s just my opinion, and I don’t have a vote. What I find fascinating is questions about how systems evolve over time, and the Internet is fertile ground for such an archeological dig (especially as it relates to the OS kernel). Given all the competition TCP/IP has faced over the years—in a battle for survival of the fittest—I now better appreciate the symbiotic role UDP played in its success. If you’re going to promote a take-over-the-world substrate, pair it with a minimal side-kick as a way of postponing decisions about alternative designs.
Systems Approach is reader-supported. To receive new posts and support our work, consider becoming a free or paid subscriber.
Somewhat belatedly we came across Scott Shenker’s position paper suggesting some significant changes to the SIGCOMM conference (starting with much higher acceptance rates). We think they warrant consideration; you can join the SIGCOMM slack channel to discuss, and also to get access to this paper, which should be freely available but somehow ended up paywalled. We found Cory Doctorow was on target (as usual) with his latest on Large Language Models, which has a fair bit in common with our recent post but is longer and funnier.