On 29.08.24 18:56, David Hildenbrand wrote:
RMAP overhaul and optimizations, PTE batching, large mapcount, folio_likely_mapped_shared() introduction and optimizations, page_mapcount cleanups and preparations ... it's been quite some work to get to this point. Next up is being able to identify -- without false positives, without page-mapcounts and without page table/rmap scanning -- whether a large folio is "mapped exclusively" into a single MM, and using that information to implement Copy-on-Write reuse and to improve folio_likely_mapped_shared() for large folios. ... and based on that, finally introducing a kernel config option that let's us not use+maintain per-page mapcounts in large folios, improving performance of (un)map operations today, taking one step towards supporting large folios > PMD_SIZE, and preparing for the bright future where we might no longer have a mapcount per page at all. The bigger picture was presented at LSF/MM [1]. This series is effectively a follow-up on my early work from last year [2], which proposed a precise way to identify whether a large folio is "mapped shared" into multiple MMs or "mapped exclusively" into a single MM. While that advanced approach has been simplified and optimized in the meantime, let's start with something simpler first -- "certainly mapped exclusive" vs. ""maybe mapped shared" -- so we can start learning about the effects and TODOs that some of the implied changes of losing per-page mapcounts has. I have plans to exchange the simple approach used in this series at some point by the advanced approach, but one important thing to learn if the imprecision in the simple approach is relevant in practice. 64BIT only, and unless enabled in kconfig, this series should for now not have any impact. 1) Patch Organization ===================== Patch #1 -> #4: make more room on 64BIT in order-1 folios Patch #5 -> #7: prepare for MM owner tracking of large folios Patch #8: implement a simple MM owner tracking approach for large folios patch #9: simple optimization Patch #10: COW reuse for PTE-mapped anon THP Patch #11 -> #17: introduce and implement CONFIG_NO_PAGE_MAPCOUNT 2) MM owner tracking ==================== Similar to my advanced approach [2], we assign each MM a unique 20-bit ID ("MM ID"), to be able to squeeze more information in our folios. Each large folios can store two MM-ID+mapcount combination: * mm0_id + mm0_mapcount * mm1_id + mm1_mapcount Combined with the large mapcount, we can reliably identify whether one of these MMs is the current owner (-> owns all mappings) or even holds all folio references (-> owns all mappings, and all references are from mappings). Stored MM IDs can only change if the corresponding mapcount is logically 0, and if the folio is currently "mapped exclusively". As long as only two MMs map folio pages at a time, we can reliably identify whether a large folio is "mapped shared" or "mapped exclusively". The approach is precise. Any MM mapping the folio while two other MMs are already mapping the folio, will lead to a "mapped shared" detection even after all other MMs stopped mapping the folio and it is actually "mapped exclusively": we can have false positives but never false negatives when detecting "mapped shared". So that's where the approach gets imprecise. For now, we use a bit-spinlock to sync the large mapcount + MM IDs + MM mapcounts, and make sure we do keep the machinery fast, to not degrade (un)map performance too much: for example, we make sure to only use a single atomic (when grabbing the bit-spinlock), like we would already perform when updating the large mapcount. In the future, we might be able to use an arch_spin_lock(), but that's future work. 3) CONFIG_NO_PAGE_MAPCOUNT ========================== patch #11 -> #17 spell out and document what exactly is affected when not maintaining the per-page mapcounts in large folios anymore. For example, as we cannot maintain folio->_nr_pages_mapped anymore when (un)mapping pages, we'll account a complete folio as mapped if a single page is mapped. As another example, we might now under-estimate the USS (Unique Set Size) of a process, but never over-estimate it. With a more elaborate approach for MM-owner tracking like #1, some things could be improved (e.g., USS to some degree), but somethings just cannot be handled like we used to without these per-page mapcounts (e.g., folio->_nr_pages_mapped). 4) Performance ============== The following kernel config combinations are possible: * Base: CONFIG_PAGE_MAPCOUNT -> (existing) page-mapcount tracking * MM-ID: CONFIG_MM_ID && CONFIG_PAGE_MAPCOUNT -> page-mapcount + MM-ID tracking * No-Mapcount: CONFIG_MM_ID && CONFIG_NO_PAGE_MAPCOUNT -> MM-ID tracking I run my PTE-mapped-THP microbenchmarks [3] and vm-scalability on a machine with two NUMA nodes, with a 10-core Intel(R) Xeon(R) Silver 4210R CPU @ 2.40GHz and 16 GiB of memory each. 4.1) PTE-mapped-THP microbenchmarks ----------------------------------- All benchmarks allocate 1 GiB of THPs of a given size, to then fork()/ munmap/... PMD-sized THPs are mapped by PTEs first. Numbers are increase (+) / reduction (-) in runtime. Reduction (-) is good. "Base" is the baseline. munmap: munmap() the allocated memory. Folio Size | MM-ID | No-Mapcount -------------------------------- 16 KiB | 2 % | -8 % 32 KiB | 3 % | -9 % 64 KiB | 4 % | -16 % 128 KiB | 3 % | -17 % 256 KiB | 1 % | -23 % 512 KiB | 1 % | -26 % 1024 KiB | 0 % | -29 % 2048 KiB | 0 % | -31 % -> 32-128 with MM-ID are a bit unexpected: we would expect to see the worst case with the smallest size (16 KiB). But for these sizes also the STDEV is between 1% and 2%, in contrast to the others (< 1 %). Maybe some weird interaction with PCP/buddy. fork: fork() Folio Size | MM-ID | No-Mapcount -------------------------------- 16 KiB | 4 % | -9 % 32 KiB | 1 % | -12 % 64 KiB | 0 % | -15 % 128 KiB | 0 % | -15 % 256 KiB | 0 % | -16 % 512 KiB | 0 % | -16 % 1024 KiB | 0 % | -17 % 2048 KiB | -1 % | -21 % -> Slight slowdown with MM-ID for the smallest folio size (more what we expect in contrast to munmap()). cow-byte: fork() and keep the child running. write one byte to each individual page, measuring the duration of all writes. Folio Size | MM-ID | No-Mapcount -------------------------------- 16 KiB | 0 % | 0 % 32 KiB | 0 % | 0 % 64 KiB | 0 % | 0 % 128 KiB | 0 % | 0 % 256 KiB | 0 % | 0 % 512 KiB | 0 % | 0 % 1024 KiB | 0 % | 0 % 2048 KiB | 0 % | 0 % -> All other overhead dominates even when effectively unmapping single pages of large folios when replacing them by a copy during write faults. No change, which is great! reuse-byte: fork() and wait until the child quit. write one byte to each individual page, measuring the duration of all writes. Folio Size | MM-ID | No-Mapcount -------------------------------- 16 KiB | -66 % | -66 % 32 KiB | -65 % | -65 % 64 KiB | -64 % | -64 % 128 KiB | -64 % | -64 % 256 KiB | -64 % | -64 % 512 KiB | -64 % | -64 % 1024 KiB | -64 % | -64 % 2048 KiB | -64 % | -64 % -> No surprise, we reuse all pages instead of copying them. child-reuse-bye: fork() and unmap the memory in the parent. write one byte to each individual page in the child, measuring the duration of all writes. Folio Size | MM-ID | No-Mapcount -------------------------------- 16 KiB | -66 % | -66 % 32 KiB | -65 % | -65 % 64 KiB | -64 % | -64 % 128 KiB | -64 % | -64 % 256 KiB | -64 % | -64 % 512 KiB | -64 % | -64 % 1024 KiB | -64 % | -64 % 2048 KiB | -64 % | -64 % -> Same thing, we reuse all pages instead of copying them. For 4 KiB, there is no change in any benchmark, as expected. 4.2) vm-scalability ------------------- For now I only ran anon COW tests. I use 1 GiB per child process and use one child per core (-> 20). case-anon-cow-rand: random writes There is effectively no change (<0.6% throughput difference). case-anon-cow-seq: sequential writes MM-ID has up to 2% *lower* throughout than Base, not really correlating to folio size. The difference is almost as large as the STDEV (1% - 2%), though. It looks like there is a very slight effective slowdown. No-Mapcount has up to 3% *higher* throughput than Base, not really correlating to the folio size. However, also here the difference is almost as large as the STDEV (up to 2%). It looks like there is a very slight effective speedup. In summary, no earth-shattering slowdown with MM-ID (and we just recently optimized folio->_nr_pages_mapped to give us some speedup :) ), and another nice improvement with No-Mapcount. I did a bunch of cross-compiles and the build bots turned out very helpful over the last months. I did quite some testing with LTP and selftests, but x86-64 only.
Gentle ping. I might soon have capacity to continue working on this. If there is no further feedback I'll rebase and resend.
-- Cheers, David / dhildenb