Hi David,
On 2024/8/6 22:40, David Hildenbrand wrote:
On 05.08.24 14:55, Qi Zheng wrote:
Now in order to pursue high performance, applications mostly use some
high-performance user-mode memory allocators, such as jemalloc or
tcmalloc. These memory allocators use madvise(MADV_DONTNEED or MADV_FREE)
to release physical memory, but neither MADV_DONTNEED nor MADV_FREE will
release page table memory, which may cause huge page table memory usage.
The following are a memory usage snapshot of one process which actually
happened on our server:
VIRT: 55t
RES: 590g
VmPTE: 110g
In this case, most of the page table entries are empty. For such a PTE
page where all entries are empty, we can actually free it back to the
system for others to use.
As a first step, this commit attempts to synchronously free the empty PTE
pages in zap_page_range_single() (MADV_DONTNEED etc will invoke this). In
order to reduce overhead, we only handle the cases with a high
probability
of generating empty PTE pages, and other cases will be filtered out, such
as:
It doesn't make particular sense during munmap() where we will just
remove the page tables manually directly afterwards. We should limit it
to the !munmap case -- in particular MADV_DONTNEED.
munmap directly calls unmap_single_vma() instead of
zap_page_range_single(), so the munmap case has already been excluded
here. On the other hand, if we try to reclaim in zap_pte_range(), we
need to identify the munmap case.
Of course, we could just modify the MADV_DONTNEED case instead of all
the callers of zap_page_range_single(), perhaps we could add a new
parameter to identify the MADV_DONTNEED case?
To minimze the added overhead, I further suggest to only try reclaim
asynchronously if we know that likely all ptes will be none, that is,
asynchronously? What you probably mean to say is synchronously, right?
when we just zapped *all* ptes of a PTE page table -- our range spans
the complete PTE page table.
Just imagine someone zaps a single PTE, we really don't want to start
scanning page tables and involve an (rather expensive) walk_page_range
just to find out that there is still something mapped.
In the munmap path, we first execute unmap and then reclaim the page
tables:
unmap_vmas
free_pgtables
Therefore, I think doing something similar in zap_page_range_single()
would be more consistent:
unmap_single_vma
try_to_reclaim_pgtables
And I think that the main overhead should be in flushing TLB and freeing
the pages. Of course, I will do some performance testing to see the
actual impact.
Last but not least, would there be a way to avoid the walk_page_range()
and simply trigger it from zap_pte_range(), possibly still while holding
the PTE table lock?
I've tried doing it that way before, but ultimately I did not choose to
do it that way because of the following reasons:
1. need to identify the munmap case
2. trying to record the count of pte_none() within the original
zap_pte_range() loop is not very convenient. The most convenient
approach is still to loop 512 times to scan the PTE page.
3. still need to release the pte lock, and then re-acquire the pmd lock
and pte lock.
We might have to trylock the PMD, but that should be doable.
Yes, It's doable.
Thanks,
Qi