Re: [RFC PATCH 00/18] Try to free user PTE page table pages

Qi Zheng <zhengqi.arch@xxxxxxxxxxxxx> · Tue, 17 May 2022 16:30:35 +0800

On 2022/4/29 9:35 PM, Qi Zheng wrote:
Hi,

This patch series aims to try to free user PTE page table pages when no one is
using it.

The beginning of this story is that some malloc libraries(e.g. jemalloc or
tcmalloc) usually allocate the amount of VAs by mmap() and do not unmap those
VAs. They will use madvise(MADV_DONTNEED) to free physical memory if they want.
But the page tables do not be freed by madvise(), so it can produce many
page tables when the process touches an enormous virtual address space.

The following figures are a memory usage snapshot of one process which actually
happened on our server:

         VIRT:  55t
         RES:   590g
         VmPTE: 110g

As we can see, the PTE page tables size is 110g, while the RES is 590g. In
theory, the process only need 1.2g PTE page tables to map those physical
memory. The reason why PTE page tables occupy a lot of memory is that
madvise(MADV_DONTNEED) only empty the PTE and free physical memory but
doesn't free the PTE page table pages. So we can free those empty PTE page
tables to save memory. In the above cases, we can save memory about 108g(best
case). And the larger the difference between the size of VIRT and RES, the
more memory we save.

In this patch series, we add a pte_ref field to the struct page of page table
to track how many users of user PTE page table. Similar to the mechanism of page
refcount, the user of PTE page table should hold a refcount to it before
accessing. The user PTE page table page may be freed when the last refcount is
dropped.

Different from the idea of another patchset of mine before[1], the pte_ref
becomes a struct percpu_ref type, and we switch it to atomic mode only in cases
such as MADV_DONTNEED and MADV_FREE that may clear the user PTE page table
entryies, and then release the user PTE page table page when checking that
pte_ref is 0. The advantage of this is that there is basically no performance
overhead in percpu mode, but it can also free the empty PTEs. In addition, the
code implementation of this patchset is much simpler and more portable than the
another patchset[1].

Hi David,

I learned from the LWN article[1] that you led a session at the LSFMM on
the problems posed by the lack of page-table reclaim (And thank you very
much for mentioning some of my work in this direction). So I want to
know, what are the further plans of the community for this problem?

For the way of adding pte_ref to each PTE page table page, I currently
posted two versions: atomic count version[2] and percpu_ref version(This
patchset).

For the atomic count version:
- Advantage: PTE pages can be freed as soon as the reference count drops
             to 0.
- Disadvantage: The addition and subtraction of pte_ref are atomic
                operations, which have a certain performance overhead,
                but should not become a performance bottleneck until the
                mmap_lock contention problem is resolved.

For the percpu_ref version:
- Advantage: In the percpu mode, the addition and subtraction of the
             pte_ref are all operations on local cpu variables, there
             is basically no performance overhead.
Disadvantage: Need to explicitly convert the pte_ref to atomic mode so
              that the unused PTE pages can be freed.

There are still many places to optimize the code implementation of these
two versions. But before I do further work, I would like to hear your
and the community's views and suggestions on these two versions.

Thanks,
Qi

[1]: https://lwn.net/Articles/893726 (Ways to reclaim unused page-table 
pages)
[2]: 
https://lore.kernel.org/lkml/20211110105428.32458-1-zhengqi.arch@xxxxxxxxxxxxx/

--
Thanks,
Qi