On 14.04.23 15:02, Ryan Roberts wrote:
Hi All, This is a second RFC and my first proper attempt at implementing variable order, large folios for anonymous memory. The first RFC [1], was a partial implementation and a plea for help in debugging an issue I was hitting; thanks to Yin Fengwei and Matthew Wilcox for their advice in solving that! The objective of variable order anonymous folios is to improve performance by allocating larger chunks of memory during anonymous page faults: - Since SW (the kernel) is dealing with larger chunks of memory than base pages, there are efficiency savings to be had; fewer page faults, batched PTE and RMAP manipulation, fewer items on lists, etc. In short, we reduce kernel overhead. This should benefit all architectures. - Since we are now mapping physically contiguous chunks of memory, we can take advantage of HW TLB compression techniques. A reduction in TLB pressure speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce TLB entries; "the contiguous bit" (architectural) and HPA (uarch) - see [2]. This patch set deals with the SW side of things only but sets us up nicely for taking advantage of the HW improvements in the near future. I'm not yet benchmarking a wide variety of use cases, but those that I have looked at are positive; I see kernel compilation time improved by up to 10%, which I expect to improve further once I add in the arm64 "contiguous bit". Memory consumption is somewhere between 1% less and 2% more, depending on how its measured. More on perf and memory below. The patches are based on v6.3-rc6 + patches 1-31 of [3] (which needed one minor conflict resolution). I have a tree at [4]. [1] https://lore.kernel.org/linux-mm/20230317105802.2634004-1-ryan.roberts@xxxxxxx/ [2] https://lore.kernel.org/linux-mm/d347c5b0-0c0f-ae50-9613-2cf962d8676e@xxxxxxx/ [3] https://lore.kernel.org/linux-mm/20230315051444.3229621-1-willy@xxxxxxxxxxxxx/ [4] https://gitlab.arm.com/linux-arm/linux-rr/-/tree/features/granule_perf/anon_folio-lkml-rfc2 Approach ======== There are 4 fault paths that have been modified: - write fault on unallocated address: do_anonymous_page() - write fault on zero page: wp_page_copy() - write fault on non-exclusive CoW page: wp_page_copy() - write fault on exclusive CoW page: do_wp_page()/wp_page_reuse() In the first 2 cases, we will determine the preferred order folio to allocate, limited by a max order (currently order-4; see below), VMA and PMD bounds, and state of neighboring PTEs. In the 3rd case, we aim to allocate the same order folio as the source, subject to constraints that may arise if the source has been mremapped or partially munmapped. And in the 4th case, we reuse as much of the folio as we can, subject to the same mremap/munmap constraints.
Just a note (that you maybe already know) that we have to be a bit careful in the wp_copy path with replacing sub-pages that are marked exclusive.
Currently, we always only replace a single shared anon (sub)page by a fresh exclusive base-page during a write-fault/unsharing. As the sub-page is already marked "maybe shared", it cannot get pinned concurrently and everybody is happy.
If you now decide to replace more subpages, you have to be careful that none of them are still exclusive -- because they could get pinned concurrently and replacing them would result in memory corruptions.
There are scenarios (most prominently: MADV_WIPEONFORK), but also failed partial fork() that could result in something like that.
Further, we have to be a bit careful regarding replacing ranges that are backed by different anon pages (for example, due to fork() deciding to copy some sub-pages of a PTE-mapped folio instead of sharing all sub-pages).
So what should be safe is replacing all sub-pages of a folio that are marked "maybe shared" by a new folio under PT lock. However, I wonder if it's really worth the complexity. For THP we were happy so far to *not* optimize this, implying that maybe we shouldn't worry about optimizing the fork() case for now that heavily.
One optimization once could think of instead (that I raised previously in other context) is the detection of exclusivity after fork()+exit in the child (IOW, only the parent continues to exist). Once PG_anon_exclusive was cleared for all sub-pages of the THP-mapped folio during fork(), we'd always decide to copy instead of reuse (because page_count() > 1, as the folio is PTE mapped). Scanning the surrounding page table if it makes sense (e.g., page_count() <= folio_nr_pages()), to test if all page references are from the current process would allow for reusing the folio (setting PG_anon_exclusive) for the sub-pages. The smaller the folio order, the cheaper this "scan surrounding PTEs" scan is. For THP, which are usually PMD-mapped even after fork()+exit, we didn't add this optimization.
-- Thanks, David / dhildenb