Hi all, we are aware of workloads that can trigger allocation of a lot of page tables that are essentially unnecessary. The obvious candidates are processes that dynamically manage memory consumption in large, sparse memory mappings e.g., via madvise(MADV_DONTNEED): hypervisors that implement memory ballooning or virtio-mem, and memory allocators. In fact, it's easy to have a process that almost exclusively consumes page tables only, and it's hard to distinguish between "malicious" and "sane" workload when just looking at the page table consumption. I have quite some neat examples that I can present. Page tables are unmovable in memory an cannot get swapped out. So heavy page table consumption isn't only problematic because we end up wasting system RAM and fragmenting system RAM with unmovable allocations, it's also a problem when having big portions of system RAM managed by CMA/ZONE_MOVABLE where we can just run out of system RAM available for unmovable allocations and eventually harm the system / other workloads in the same machine. One approach I'd like to discuss is page table reclaim: reclaiming unnecessary page tables, which involves a lot of challenges. 1. Efficient page table reclaim "Ripping out" a page table is an expensive and highly complicated operation: just take a look at khugepaged. We have to block all page table walkers, which requires the mmap_lock in write mode, the rmap lock, and proper synchronization with GUP-fast. In the simplest approach, we'd scan for candidate page tables to then rip them out. But: * How to scan for candidate page tables efficiently? * How to avoid the mmap_lock in write mode when removing a page table? * How to avoid the rmap lock (just imagine a page table spanning multiple rmaps)? But also: how to make the implementation simple and appealing to get merged upstream? For example, the last attempt to reclaim empty PTE page tables [1] automatically once the last PTE was zapped was not merged yet because it certainly adds complexity. How to avoid that complexity? 2. Who triggers reclaim and when? Letting an application trigger reclaim of page tables is the "easiest solution": let's imagine madvise(MADV_RECLAIM_PGTABLES). However, this doesn't take care of malicious workloads and is more problematic when having sparse files mapped into multiple processes. Further, there is no need to reclaim if we're not under memory pressure. Letting the system do this automatically looks "cleaner". But, when to start reclaiming? How to detect and handle malicious processes (do we care?)? How to set an adequate soft/hard limit? 3. Which page tables to reclaim? While the obvious candidates are empty page tables, we can easily have page tables all filled with the shared zeropage instead. Once again, there are sane and malicious use cases. A sane use case is a simple VM having a balloon inflated and triggering a memory dump like kdump: we'll populate the shared zeropage everywhere and have plenty of page tables we don't even care about. But once we talk about reclaiming page tables that are still populated with the shared zeropage, why not reclaim page tables that are "reconstructable", for example, because they don't map anonymous pages and don't require special fault handling (userfaultfd?)? While I do have answers to some of the questions and various ideas, it's certainly an interesting topic to discuss and brainstorm. [1]https://lkml.kernel.org/r/20211110105428.32458-1-zhengqi.arch@xxxxxxxxxxxxx -- Thanks, David / dhildenb