On 05.08.24 14:55, Qi Zheng wrote:
Now in order to pursue high performance, applications mostly use some high-performance user-mode memory allocators, such as jemalloc or tcmalloc. These memory allocators use madvise(MADV_DONTNEED or MADV_FREE) to release physical memory, but neither MADV_DONTNEED nor MADV_FREE will release page table memory, which may cause huge page table memory usage. The following are a memory usage snapshot of one process which actually happened on our server: VIRT: 55t RES: 590g VmPTE: 110g In this case, most of the page table entries are empty. For such a PTE page where all entries are empty, we can actually free it back to the system for others to use. As a first step, this commit attempts to synchronously free the empty PTE pages in zap_page_range_single() (MADV_DONTNEED etc will invoke this). In order to reduce overhead, we only handle the cases with a high probability of generating empty PTE pages, and other cases will be filtered out, such as:
It doesn't make particular sense during munmap() where we will just remove the page tables manually directly afterwards. We should limit it to the !munmap case -- in particular MADV_DONTNEED.
To minimze the added overhead, I further suggest to only try reclaim asynchronously if we know that likely all ptes will be none, that is, when we just zapped *all* ptes of a PTE page table -- our range spans the complete PTE page table.
Just imagine someone zaps a single PTE, we really don't want to start scanning page tables and involve an (rather expensive) walk_page_range just to find out that there is still something mapped.
Last but not least, would there be a way to avoid the walk_page_range() and simply trigger it from zap_pte_range(), possibly still while holding the PTE table lock?
We might have to trylock the PMD, but that should be doable. -- Cheers, David / dhildenb