[LSF/MM/BPF TOPIC] The chicken-egg dilemma of virtual memory and hardware architectures

"Zhang, Jiyuan" <jiyuanz3@xxxxxxxxxxxx> · Mon, 27 Jan 2025 16:59:10 +0000

Hi everyone,

We hope to propose a discussion on kernel support for new virtual memory architectures, in the context of RISC-V, and how we can make progress in the presence of a chicken-egg dilemma between hardware and OS. We’re not super sure whether
 LSF/MM/BPF is the right venue – if anyone knows a better platform, please do let us know.

TL;DR
---------

Virtual memory is a major performance bottleneck of memory-intensive datacenter workloads, due to the overhead of page-table walks. Many fast virtual memory architectures have been designed, but none is made into reality. The root cause
 is the chicken-egg dilemma between hardware vendors and kernel maintainers (virtual memory crosses the hardware-software boundary). Hardware vendors (including Intel and ARM, as well as RISC-V) are reluctant to invest without “approval” from Linux, while the
 Linux community seems to have little interest in futuristic hardware architectures. As a result, it is hard to innovate beyond incremental patches.

We hope to discuss:
* What shall be the right protocol to solve the chicken-egg problem? How do we have a thumb-up from Linux to be able to convince hardware vendors?
* What interface needs to be in place to promote architectural innovations? A known problem is that Linux’s mm code makes hard assumptions on the MMU’s translation schemes.

Why now?
--------------

The overhead of virtual memory translation is not new, but is manifesting more predominantly due to the following reasons:
* The significant growth of memory capacity of both DRAM (now terabytes of DRAM is not uncommon) and expanded memory (e.g., CXL is coming)
* TLB and MMU caches start to show low locality due to (1) the gap to the increasing memory capacity and (2) increasing irregular memory access patterns of modern workloads (think about bioinformatics and even graph workloads).
* Nested translation in virtualized environments multiplies the translation overhead.
* Huge pages and better MMU caches help, but are fundamentally limited, as repeatedly shown by recent research. 

Context
-----------

We have been designing new virtual memory architectures that can significantly accelerate memory translation such as DMT (https://dl.acm.org/doi/10.1145/3620665.3640358) and ECPT
 (https://dl.acm.org/doi/10.1145/3373376.3378493). Those designs show very exciting results on research papers; however, bringing them into practice falls short due to the chicken-egg dilemma. When
 we talk to hardware vendors like Intel and recently RISC-V, the main concern is always about OS support (oftentimes specifically Linux support as we target datacenter workloads) – we’re asked to “get a nod from Linux kernel folks”. However, we are also told
 that the Linux community is not interested in any futuristic hardware, especially that Linux’s mm code is pretty much hardwired to tree-based translation architectures.  

Let me make things more concrete using the DMT architecture (which we are proposing to RISC-V). DMT is a design that is fully compatible with x86-64. The basic idea is to regulate the layout of x86-64 page tables in physical memory in a
 way that allows the MMU to skip intermediate page table entries and directly fetch the last-level entries. In this way, DMT can achieve address translation in one memory access on bare metal, and two for virtual machines. As the key principle, DMT relies on
 the OS to allocate and manage the page tables to enable simple, fast hardware translation. 

Would Linux embrace the support needed by DMT (which may make mm a bit more complex)? If so, how do we get a nod? And, In which protocol?

Forward looking
----------------------

DMT takes x86-64 compatibility as a first-class design principle, with a hope of adoption. The OS support is nontrivial, but modest. 

There are other advanced virtual memory architectures which need disruptive OS changes, such as ECPT which uses elastic cuckoo hashing to organize process-private page tables. Today, supporting such designs is not possible, without major
 surgery of the Linux code. That’s the reason we failed to push ECPT into practice after talking to hardware vendors like Intel.

In a recent research project, we figured out a new interface (named EMT) to refactor Linux into a form that treats mm code as drivers of MMUs and thus can support different memory translation architectures with modularized effort – think
 about EMT as something similar to VFS but for virtual memory. What’s exciting is that EMT incurs negligible performance overhead (on both micro and macro benchmarks) through careful design and engineering. If you are interested in EMT, we’re happy to discuss
 that also.

Certainly, our main goal is still to find the right protocol (which is likely less technical and may be more political) to solve the chicken-egg dilemma so that we can enable innovations on faster, more efficient memory architectures to
 benefit everyone.

Thanks,

Jiyuan Zhang
Tianyin Xu
Department of Computer Science
University of Illinois Urbana-Champaign