On 19.08.21 05:18, Qi Zheng wrote:
Some malloc libraries(e.g. jemalloc or tcmalloc) usually
allocate the amount of VAs by mmap() and do not unmap
those VAs. They will use madvise(MADV_DONTNEED) to free
physical memory if they want. But the page tables do not
be freed by madvise(), so it can produce many page tables
when the process touches an enormous virtual address space.
The following figures are a memory usage snapshot of one
process which actually happened on our server:
VIRT: 55t
RES: 590g
VmPTE: 110g
As we can see, the PTE page tables size is 110g, while the
RES is 590g. In theory, the process only need 1.2g PTE page
tables to map those physical memory. The reason why PTE page
tables occupy a lot of memory is that madvise(MADV_DONTNEED)
only empty the PTE and free physical memory but doesn't free
the PTE page table pages. So we can free those empty PTE page
tables to save memory. In the above cases, we can save memory
about 108g(best case). And the larger the difference between
the size of VIRT and RES, the more memory we save.
In this patch series, we add a pte_refcount field to the
struct page of page table to track how many users of PTE page
table. Similar to the mechanism of page refcount, the user of
PTE page table should hold a refcount to it before accessing.
The PTE page table page will be freed when the last refcount
is dropped.
While we access ->pte_refcount of a PTE page table, any of the
following ensures the pmd entry corresponding to the PTE page
table stability:
- mmap_lock
- anon_lock
- i_mmap_lock
- parallel threads are excluded by other means which
can make ->pmd stable(e.g. gup case)
This patch does not support THP temporarily, it will be
supported in the next patch.
Can you clarify (and document here) who exactly takes a reference on the
page table? Do I understand correctly that
a) each !pte_none() entry inside a page table take a reference to the
page it's containted in.
b) each page table walker temporarily grabs a page table reference
c) The PMD tables the PTE is referenced in (->currently only ever a
single one) does *not* take a reference.
So if there are no PTE entries left and nobody walks the page tables,
you can remove it? You should really extend the
description/documentation to make it clearer how exactly it's supposed
to work.
It feels kind of strange to not introduce the CONFIG_FREE_USER_PTE
Kconfig option in this patch. At least it took me a while to identify it
in the previous patch.
Maybe you should introduce the empty stubs and use them in a separate
patch, and then have this patch just introduce CONFIG_FREE_USER_PTE
along with the actual refcounting magic inside the !stub implementation.
--
Thanks,
David / dhildenb