This patch series is about to haunt me in my dreams, and the time spent on this is ridiculous ... anyhow, here goes a new version that now also supports 32bit (I wish we would have dynamically allocated "struct folio" already ...) and always enables the new tracking with CONFIG_TRANSPARENT_HUGEPAGE. Let's add an "easy" way to decide -- without false positives, without page-mapcounts and without page table/rmap scanning -- whether a large folio is "certainly mapped exclusively" into a single MM, or whether it "maybe mapped shared" into multiple MMs. Use that information to implement Copy-on-Write reuse, to convert folio_likely_mapped_shared() to folio_maybe_mapped_share(), and to introduce a kernel config option that let's us not use+maintain per-page mapcounts in large folios anymore. The bigger picture was presented at LSF/MM [1]. This series is effectively a follow-up on my early work [2], which implemented a more precise, but also more complicated, way to identify whether a large folio is "mapped shared" into multiple MMs or "mapped exclusively" into a single MM. 1 Patch Organization ==================== Patch #1 -> #6: make more room in order-1 folios, so we have two "unsigned long" available for our purposes Patch #7 -> #11: preparations Patch #12: MM owner tracking for large folios Patch #13: COW reuse for PTE-mapped anon THP Patch #14: folio_maybe_mapped_shared() Patch #15 -> #20: introduce and implement CONFIG_NO_PAGE_MAPCOUNT 2 MM owner tracking =================== We assign each MM a unique ID ("MM ID"), to be able to squeeze more information in our folios. On 32bit we use 15-bit IDs, on 64bit we use 31-bit IDs. For each large folios, we now store two MM-ID+mapcount ("slot") combinations: * mm0_id + mm0_mapcount * mm1_id + mm1_mapcount On 32bit, we use a 16-bit per-MM mapcount, on 64bit an ordinary 32bit mapcount. This way, we require 2x "unsigned long" on 32bit and 64bit for both slots. Paired with the large mapcount, we can reliably identify whether one of these MMs is the current owner (-> owns all mappings) or even holds all folio references (-> owns all mappings, and all references are from mappings). As long as only two MMs map folio pages at a time, we can reliably and precisely identify whether a large folio is "mapped shared" or "mapped exclusively". Any additional MM that starts mapping the folio while there are no free slots becomes an "untracked MM". If one such "untracked MM" is the last one mapping a folio exclusively, we will *not* detect the folio as "mapped exclusively" but instead as "maybe mapped shared". (exception: only a single mapping remains) So that's where the approach gets imprecise. For now, we use a bit-spinlock to sync the large mapcount + slots, and make sure we do keep the machinery fast, to not degrade (un)map performance drastically: for example, we make sure to only use a single atomic (when grabbing the bit-spinlock), like we would already perform when updating the large mapcount. 3 CONFIG_NO_PAGE_MAPCOUNT ========================= patch #15 -> #20 spell out and document what exactly is affected when not maintaining the per-page mapcounts in large folios anymore. Most importantly, as we cannot maintain folio->_nr_pages_mapped anymore when (un)mapping pages, we'll account a complete folio as mapped if a single page is mapped. In addition, we'll not detect partially mapped anonymous folios as such in all cases yet. Likely less relevant changes include that we might now under-estimate the USS (Unique Set Size) of a process, but never over-estimate it. The goal is to make CONFIG_NO_PAGE_MAPCOUNT the default at some point, to then slowly make it the only option, as we learn about real-life impacts and possible ways to mitigate them. 4 Performance ============= Detailed performance numbers were included in v1 [4], and not that much changed between v1 and v2. I did plenty of measurements on different systems in the meantime, that all revealed slightly different results. The pte-mapped-folio micro-benchmarks are fairly sensitive to code layout changes on some systems. Especially the fork() benchmark started being more-shaky-than-before on recent kernels for some reason. In summary, with my micro-benchmarks: * Small folios are not impacted. * CoW performance seems to be mostly unchanged across all folios sizes. * CoW reuse performance of large folios now matches CoW reuse performance of small folios, because we now actually implement the CoW reuse optimization. On an Intel Xeon Silver 4210R I measured a ~65% reduction in runtime, on an arm64 system I measured ~54% reduction. * munmap() performance improves with CONFIG_NO_PAGE_MAPCOUNT. I saw double-digit % reduction (up to ~30% on an Intel Xeon Silver 4210R and up to ~70% on an AmpereOne A192-32X) with larger folios. The larger the folios, the larger the performance improvement. * munmao() performance very slightly (couple percent) degrades without CONFIG_NO_PAGE_MAPCOUNT for smaller folios. For larger folios, there seems to be no change at all. * fork() performance improves with CONFIG_NO_PAGE_MAPCOUNT. I saw double-digit % reduction (up to ~20% on an Intel Xeon Silver 4210R and up to ~10% on an AmpereOne A192-32X) with larger folios. The larger the folios, the larger the performance improvement. * While fork() performance without CONFIG_NO_PAGE_MAPCOUNT seems to be almost unchanged on some systems, I saw some degradation for smaller folios on the AmpereOne A192-32X. I did not investigate the details yet, but I suspect code layout changes or suboptimal code placement / inlining. I'm not to worried about the fork() micro-benchmarks for smaller folios given how shaky the results are lately and by how much we improved fork() performance recently. I also ran case-anon-cow-rand and case-anon-cow-seq part of vm-scalability, to assess the scalability and the impact of the bit-spinlock. My measurements on a two 2-socket 10-core Intel Xeon Silver 4210R CPU revealed no significant changes. Similarly, running these benchmarks with 2 MiB THPs enabled on the AmpereOne A192-32X with 192 cores, I got < 1% difference with < 1% stdev, which is nice. So far, I did not get my hands on a similarly large system with multiple sockets. I found no other fitting scalability benchmarks that seem to really hammer on concurrent mapping/unmapping of large folio pages like case-anon-cow-seq does. 5 Concerns ========== 5.1 Bit spinlock ---------------- I'm not quite happy about the bit-spinlock, but so far it does not seem to affect scalability in my measurements. If it ever becomes a problem we could either investigate improving the locking, or simply stopping the MM tracking once there are "too many mappings" and simply assume that the folio is "mapped shared" until it was freed. This would be similar (but slightly different) to the "0,1,2,stopped" counting idea Willy had at some point. Adding that logic to "stop tracking" adds more code to the hot path, so I avoided that for now. 5.2 folio_maybe_mapped_shared() ------------------------------- I documented the change from folio_likely_mapped_shared() to folio_maybe_mapped_shared() quite extensively. If we run into surprises, I have some ideas on how to resolve them. For now, I think we should be fine. 5.3 Added code to map/unmap hot path ------------------------------------ So far, it looks like the added code on the rmap hot path does not really seem to matter much in the bigger picture. I'd like to further reduce it (and possibly improve fork() performance further), but I don't easily see how right now. Well, and I am out of puff :) Having that said, alternatives I considered (e.g., per-MM per-folio mapcount) would add a lot more overhead to these hot paths. 6 Future Work ============= 6.1 Large mapcount ------------------ It would be very handy if the large mapcount would count how often folio pages are actually mapped into page tables: a PMD on x86-64 would count 512 times. Calculating the average per-page mapcount will be easy, and remapping (PMD->PTE) folios would get even faster. That would also remove the need for the entire mapcount (except for PMD-sized folios for memory statistics reasons ...), and allow for mapping folios larger than PMDs (e.g., 4 MiB) easily. The downside is that we maybe would also have to take the same number of folio references to make our folio_mapcount() == folio_ref_count() work. I think it should be possible (user space could trigger many PTE mappings already), but we be a bit more careful about possible mapcount/refcount overflows. (e.g., fail mapping a folio if the mapcount exceeds a certain threshold) Maybe some day we'll have a 64bit refcount+mapcount. 6.2 hugetlb ----------- I'd love to make use of the same tracking also for hugetlb. The real problem is PMD table sharing: getting a page mapped by MM X and unmapped by MM Y will not work. With mshare, that problem should not exist (all mapping/unmapping will be routed through the mshare MM). 7 Version Updates ================= I did a bunch of cross-compiles and quite some testing on i386, x86-64 and arm64. The build bots were very helpful as well. To keep the CC list short, adding only relevant subsystem maintainers (CCed on all patches, sorry :) ). v1 -> v2: * 32bit support. It would all be easier if we would already allocate "struct folio" dynamically, but fortunately when we manage to do that, it will just clean that part up again. For now, we have to relocate in "struct folio" the _pincount and _entire_mapcount on 32bit, and the hugetlb data unconditionally. * "mm/rmap: basic MM owner tracking for large folios (!hugetlb)" -> Now unconditionally enabled with CONFIG_TRANSPARENT_HUGEPAGE -> Some changes to slot handling to handle some edge cases in a better way. -> Reworked the way flags are stored, in light of 32bit support. * "mm: convert folio_likely_mapped_shared() to folio_maybe_mapped_shared()" -> Use the new logic always such that we can rename the function * A bunch of cleanups/simplifications Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> Cc: "Matthew Wilcox (Oracle)" <willy@xxxxxxxxxxxxx> Cc: Tejun Heo <tj@xxxxxxxxxx> Cc: Zefan Li <lizefan.x@xxxxxxxxxxxxx> Cc: Johannes Weiner <hannes@xxxxxxxxxxx> Cc: "Michal Koutný" <mkoutny@xxxxxxxx> Cc: Jonathan Corbet <corbet@xxxxxxx> Cc: Andy Lutomirski <luto@xxxxxxxxxx> Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx> Cc: Ingo Molnar <mingo@xxxxxxxxxx> Cc: Borislav Petkov <bp@xxxxxxxxx> Cc: Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx> Cc: Muchun Song <muchun.song@xxxxxxxxx> Cc: "Liam R. Howlett" <Liam.Howlett@xxxxxxxxxx> Cc: Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx> Cc: Vlastimil Babka <vbabka@xxxxxxx> Cc: Jann Horn <jannh@xxxxxxxxxx> David Hildenbrand (20): mm: factor out large folio handling from folio_order() into folio_large_order() mm: factor out large folio handling from folio_nr_pages() into folio_large_nr_pages() mm: let _folio_nr_pages overlay memcg_data in first tail page mm: move hugetlb specific things in folio to page[3] mm: move _pincount in folio to page[2] on 32bit mm: move _entire_mapcount in folio to page[2] on 32bit mm/rmap: pass dst_vma to folio_dup_file_rmap_pte() and friends mm/rmap: pass vma to __folio_add_rmap() mm/rmap: abstract large mapcount operations for large folios (!hugetlb) bit_spinlock: __always_inline (un)lock functions mm/rmap: use folio_large_nr_pages() in add/remove functions mm/rmap: basic MM owner tracking for large folios (!hugetlb) mm: Copy-on-Write (COW) reuse support for PTE-mapped THP mm: convert folio_likely_mapped_shared() to folio_maybe_mapped_shared() mm: CONFIG_NO_PAGE_MAPCOUNT to prepare for not maintain per-page mapcounts in large folios fs/proc/page: remove per-page mapcount dependency for /proc/kpagecount (CONFIG_NO_PAGE_MAPCOUNT) fs/proc/task_mmu: remove per-page mapcount dependency for PM_MMAP_EXCLUSIVE (CONFIG_NO_PAGE_MAPCOUNT) fs/proc/task_mmu: remove per-page mapcount dependency for "mapmax" (CONFIG_NO_PAGE_MAPCOUNT) fs/proc/task_mmu: remove per-page mapcount dependency for smaps/smaps_rollup (CONFIG_NO_PAGE_MAPCOUNT) mm: stop maintaining the per-page mapcount of large folios (CONFIG_NO_PAGE_MAPCOUNT) .../admin-guide/cgroup-v1/memory.rst | 4 + Documentation/admin-guide/cgroup-v2.rst | 10 +- Documentation/admin-guide/mm/pagemap.rst | 16 +- Documentation/filesystems/proc.rst | 28 +- Documentation/mm/transhuge.rst | 39 ++- fs/proc/internal.h | 39 +++ fs/proc/page.c | 19 +- fs/proc/task_mmu.c | 39 ++- include/linux/bit_spinlock.h | 8 +- include/linux/mm.h | 93 +++--- include/linux/mm_types.h | 110 +++++-- include/linux/page-flags.h | 4 + include/linux/rmap.h | 270 ++++++++++++++++-- kernel/fork.c | 36 +++ mm/Kconfig | 22 ++ mm/debug.c | 10 +- mm/gup.c | 8 +- mm/huge_memory.c | 20 +- mm/hugetlb.c | 1 - mm/internal.h | 20 +- mm/khugepaged.c | 8 +- mm/madvise.c | 6 +- mm/memory.c | 96 ++++++- mm/mempolicy.c | 8 +- mm/migrate.c | 7 +- mm/mprotect.c | 2 +- mm/page_alloc.c | 50 +++- mm/page_owner.c | 2 +- mm/rmap.c | 118 ++++++-- 29 files changed, 903 insertions(+), 190 deletions(-) base-commit: f7ed46277aaa8f848f18959ff68469f5186ba87c -- 2.48.1