Re: [PATCH v4 00/11] synchronously scan and reclaim empty user PTE pages

Qi Zheng <zhengqi.arch@xxxxxxxxxxxxx> · Thu, 5 Dec 2024 11:56:02 +0800

On 2024/12/5 06:49, Andrew Morton wrote:
On Wed,  4 Dec 2024 19:09:40 +0800 Qi Zheng <zhengqi.arch@xxxxxxxxxxxxx> wrote:

...

Previously, we tried to use a completely asynchronous method to reclaim empty
user PTE pages [1]. After discussing with David Hildenbrand, we decided to
implement synchronous reclaimation in the case of madvise(MADV_DONTNEED) as the
first step.

Please help us understand what the other steps are.  Because we dont
want to commit to a particular partial implementation only to later
discover that completing that implementation causes us problems.

Although it is the first step, it is relatively independent because it
solve the problem (huge PTE memory usage) in the case of
madvise(MADV_DONTNEED), while the other steps are to solve the problem
in other cases.

I can briefly describe all the plans in my mind here:

First step
==========

I plan to implement synchronous empty user PTE pages reclamation in
madvise(MADV_DONTNEED) case for the following reasons:

1. It covers most of the known cases. (On ByteDance server, all the
   problems of huge PTE memory usage are in this case)
2. It helps verify the lock protection scheme and other infrastructure.

This is what this patch is doing (only support x86). Once this is done,
support for more architectures will be added.

Second step
===========

I plan to implement asynchronous reclamation for madvise(MADV_FREE) and
other cases. The initial idea is to mark vma first, then add the
corresponding mm to a global linked list, and then perform asynchronous
scanning and reclamation in the memory reclamation process.

Third step
==========

Based on the above infrastructure, we may try to reclaim all full-zero
PTE pages (all pte entries map zero page), which will be beneficial to
the memory balloon case mentioned by David Hildenbrand.

Another plan
============

Currently, page table modification are protected by page table locks
(page_table_lock or split pmd/pte lock), but the life cycle of page
table pages are protected by mmap_lock (and vma lock). For more details,
please refer to the latest added Documentation/mm/process_addrs.rst file.

Currently we try to free the PTE pages through RCU when
CONFIG_PT_RECLAIM is turned on. In this case, we will no longer
need to hold mmap_lock for the read/write op on the PTE pages.

So maybe we can remove the page table from the protection of the mmap
lock (which is too big), like this:

1. free all levels of page table pages by RCU, not just PTE pages, but
   also pmd, pud, etc.
2. similar to pte_offset_map/pte_unmap, add
   [pmd|pud]_offset_map/[pmd|pud]_unmap, and make them all contain
   rcu_read_lock/rcu_read_unlcok, and make them accept failure.

In this way, we no longer need the mmap lock. For readers, such as page
table wallers, we are already in the critical section of RCU. For
writers, we only need to hold the page table lock.

But there is a difficulty here, that is, the RCU critical section is not
allowed to sleep, but it is possible to sleep in the callback function
of .pmd_entry, such as mmu_notifier_invalidate_range_start().

Use SRCU instead? Or use RCU + refcount method? Not sure. But I think
it's an interesting thing to try.

Thanks!