[LSF/MM/BPF TOPIC] page table reclaim

David Hildenbrand <david@xxxxxxxxxx> · Tue, 22 Feb 2022 09:56:20 +0100

Hi all,

we are aware of workloads that can trigger allocation of a lot of page
tables that are essentially unnecessary. The obvious candidates are
processes that dynamically manage memory consumption in large, sparse
memory mappings e.g., via madvise(MADV_DONTNEED): hypervisors that
implement memory ballooning or virtio-mem, and memory allocators.

In fact, it's easy to have a process that almost exclusively consumes
page tables only, and it's hard to distinguish between "malicious" and
"sane" workload when just looking at the page table consumption. I have
quite some neat examples that I can present.

Page tables are unmovable in memory an cannot get swapped out. So heavy
page table consumption isn't only problematic because we end up wasting
system RAM and fragmenting system RAM with unmovable allocations, it's
also a problem when having big portions of system RAM managed by
CMA/ZONE_MOVABLE where we can just run out of system RAM available for
unmovable allocations and eventually harm the system / other workloads
in the same machine.

One approach I'd like to discuss is page table reclaim: reclaiming
unnecessary page tables, which involves a lot of challenges.

1. Efficient page table reclaim

"Ripping out" a page table is an expensive and highly complicated
operation: just take a look at khugepaged. We have to block all page
table walkers, which requires the mmap_lock in write mode, the rmap
lock, and proper synchronization with GUP-fast.

In the simplest approach, we'd scan for candidate page tables to then
rip them out. But:
* How to scan for candidate page tables efficiently?
* How to avoid the mmap_lock in write mode when removing a page table?
* How to avoid the rmap lock (just imagine a page table spanning
  multiple rmaps)?

But also: how to make the implementation simple and appealing to get
merged upstream? For example, the last attempt to reclaim empty PTE page
tables [1] automatically once the last PTE was zapped was not merged yet
because it certainly adds complexity. How to avoid that complexity?

2. Who triggers reclaim and when?

Letting an application trigger reclaim of page tables is the "easiest
solution": let's imagine madvise(MADV_RECLAIM_PGTABLES). However, this
doesn't take care of malicious workloads and is more problematic when
having sparse files mapped into multiple processes. Further, there is no
need to reclaim if we're not under memory pressure.

Letting the system do this automatically looks "cleaner". But, when to
start reclaiming? How to detect and handle malicious processes (do we
care?)? How to set an adequate soft/hard limit?

3. Which page tables to reclaim?

While the obvious candidates are empty page tables, we can easily have
page tables all filled with the shared zeropage instead. Once again,
there are sane and malicious use cases. A sane use case is a simple VM
having a balloon inflated and triggering a memory dump like kdump: we'll
populate the shared zeropage everywhere and have plenty of page tables
we don't even care about.

But once we talk about reclaiming page tables that are still populated
with the shared zeropage, why not reclaim page tables that are
"reconstructable", for example, because they don't map anonymous pages
and don't require special fault handling (userfaultfd?)?

While I do have answers to some of the questions and various ideas, it's
certainly an interesting topic to discuss and brainstorm.

[1]https://lkml.kernel.org/r/20211110105428.32458-1-zhengqi.arch@xxxxxxxxxxxxx

-- 
Thanks,

David / dhildenb