On 07.08.24 05:58, Qi Zheng wrote:
Hi David,
Really sorry for the slow replies, I'm struggling with a mixture of
public holidays, holiday and too many different discussions (well, and
some stuff I have to finish myself).
On 2024/8/6 22:40, David Hildenbrand wrote:
On 05.08.24 14:55, Qi Zheng wrote:
Now in order to pursue high performance, applications mostly use some
high-performance user-mode memory allocators, such as jemalloc or
tcmalloc. These memory allocators use madvise(MADV_DONTNEED or MADV_FREE)
to release physical memory, but neither MADV_DONTNEED nor MADV_FREE will
release page table memory, which may cause huge page table memory usage.
The following are a memory usage snapshot of one process which actually
happened on our server:
VIRT: 55t
RES: 590g
VmPTE: 110g
In this case, most of the page table entries are empty. For such a PTE
page where all entries are empty, we can actually free it back to the
system for others to use.
As a first step, this commit attempts to synchronously free the empty PTE
pages in zap_page_range_single() (MADV_DONTNEED etc will invoke this). In
order to reduce overhead, we only handle the cases with a high
probability
of generating empty PTE pages, and other cases will be filtered out, such
as:
It doesn't make particular sense during munmap() where we will just
remove the page tables manually directly afterwards. We should limit it
to the !munmap case -- in particular MADV_DONTNEED.
munmap directly calls unmap_single_vma() instead of
zap_page_range_single(), so the munmap case has already been excluded
here. On the other hand, if we try to reclaim in zap_pte_range(), we
need to identify the munmap case.
Of course, we could just modify the MADV_DONTNEED case instead of all
the callers of zap_page_range_single(), perhaps we could add a new
parameter to identify the MADV_DONTNEED case?
See below, zap_details might come in handy.
To minimze the added overhead, I further suggest to only try reclaim
asynchronously if we know that likely all ptes will be none, that is,
asynchronously? What you probably mean to say is synchronously, right?
when we just zapped *all* ptes of a PTE page table -- our range spans
the complete PTE page table.
Just imagine someone zaps a single PTE, we really don't want to start
scanning page tables and involve an (rather expensive) walk_page_range
just to find out that there is still something mapped.
In the munmap path, we first execute unmap and then reclaim the page
tables:
unmap_vmas
free_pgtables
Therefore, I think doing something similar in zap_page_range_single()
would be more consistent:
unmap_single_vma
try_to_reclaim_pgtables
And I think that the main overhead should be in flushing TLB and freeing
the pages. Of course, I will do some performance testing to see the
actual impact.
Last but not least, would there be a way to avoid the walk_page_range()
and simply trigger it from zap_pte_range(), possibly still while holding
the PTE table lock?
I've tried doing it that way before, but ultimately I did not choose to
do it that way because of the following reasons:
I think we really should avoid another page table walk if possible.
1. need to identify the munmap case
We already have "struct zap_details". Maybe we can extend that to
specify what our intention are (either where we come from or whether we
want to try ripping out apge tables directly).
2. trying to record the count of pte_none() within the original
zap_pte_range() loop is not very convenient. The most convenient
approach is still to loop 512 times to scan the PTE page.
Right, the code might need some reshuffling. As we might temporary drop
the PTL (break case), fully relying on everything being pte_none()
doesn't always work.
We could either handle it in zap_pmd_range(), after we processed a full
PMD range. zap_pmd_range() knows for sure whether the full PMD range was
covered, even if multiple zap_pte_range() calls were required.
Or we could indicate to zap_pte_range() the original range. Or we could
make zap_pte_range() simply handle the retrying itself, and not get
called multiple times for a single PMD range.
So the key points are:
(a) zap_pmd_range() should know for sure whether the full range is
covered by the zap.
(b) zap_pte_range() knows whether it left any entries being (IOW, it n
never ran into the "!should_zap_folio" case)
(c) we know whether we temporarily had to drop the PTL and someone might
have converted pte_none() to something else.
Teaching zap_pte_range() to handle a full within-PMD range itself sounds
cleanest.
Then we can handle it fully in zap_pte_range():
(a) if we had to leave entries behind (!pte_none()), no need to try
ripping out the page table.
(b) if we didn't have to drop the PTL, we can remove the page table
without even re-verifying whether the entries are pte_none(). We
know they are. If we had to drop the PTL, we have to re-verify at
least the PTEs that were not zapped in the last iteration.
So there is the chance to avoid pte_none() checks completely, or minimze
them if we had to drop the PTL.
Anything I am missing? Please let me know if anything is unclear.
Reworking the retry logic for zap_pte_range(), to be called for a single
PMD only once is likely the first step.
3. still need to release the pte lock, and then re-acquire the pmd lock
and pte lock.
Yes, if try-locking the PMD fails.
--
Cheers,
David / dhildenb