Re: [PATCH v1 00/17] mm: MM owner tracking for large folios (!hugetlb) + CONFIG_NO_PAGE_MAPCOUNT

David Hildenbrand <david@xxxxxxxxxx> · Wed, 23 Oct 2024 11:10:40 +0200

On 29.08.24 18:56, David Hildenbrand wrote:
RMAP overhaul and optimizations, PTE batching, large mapcount,
folio_likely_mapped_shared() introduction and optimizations, page_mapcount
cleanups and preparations ... it's been quite some work to get to this
point.

Next up is being able to identify -- without false positives, without
page-mapcounts and without page table/rmap scanning -- whether a
large folio is "mapped exclusively" into a single MM, and using that
information to implement Copy-on-Write reuse and to improve
folio_likely_mapped_shared() for large folios.

... and based on that, finally introducing a kernel config option that
let's us not use+maintain per-page mapcounts in large folios, improving
performance of (un)map operations today, taking one step towards
supporting large folios  > PMD_SIZE, and preparing for the bright future
where we might no longer have a mapcount per page at all.

The bigger picture was presented at LSF/MM [1].

This series is effectively a follow-up on my early work from last
year [2], which proposed a precise way to identify whether a large folio is
"mapped shared" into multiple MMs or "mapped exclusively" into a single MM.

While that advanced approach has been simplified and optimized in the
meantime, let's start with something simpler first -- "certainly mapped
exclusive" vs. ""maybe mapped shared" -- so we can start learning about
the effects and TODOs that some of the implied changes of losing
per-page mapcounts has.

I have plans to exchange the simple approach used in this series at some
point by the advanced approach, but one important thing to learn if the
imprecision in the simple approach is relevant in practice.

64BIT only, and unless enabled in kconfig, this series should for now
not have any impact.

1) Patch Organization
=====================

Patch #1 -> #4: make more room on 64BIT in order-1 folios

Patch #5 -> #7: prepare for MM owner tracking of large folios

Patch #8: implement a simple MM owner tracking approach for large folios

patch #9: simple optimization

Patch #10: COW reuse for PTE-mapped anon THP

Patch #11 -> #17: introduce and implement CONFIG_NO_PAGE_MAPCOUNT

2) MM owner tracking
====================

Similar to my advanced approach [2], we assign each MM a unique 20-bit ID
("MM ID"), to be able to squeeze more information in our folios.

Each large folios can store two MM-ID+mapcount combination:
* mm0_id + mm0_mapcount
* mm1_id + mm1_mapcount

Combined with the large mapcount, we can reliably identify whether one
of these MMs is the current owner (-> owns all mappings) or even holds
all folio references (-> owns all mappings, and all references are from
mappings).

Stored MM IDs can only change if the corresponding mapcount is logically
0, and if the folio is currently "mapped exclusively".

As long as only two MMs map folio pages at a time, we can reliably identify
whether a large folio is "mapped shared" or "mapped exclusively". The
approach is precise.

Any MM mapping the folio while two other MMs are already mapping the folio,
will lead to a "mapped shared" detection even after all other MMs stopped
mapping the folio and it is actually "mapped exclusively": we can have
false positives but never false negatives when detecting "mapped shared".

So that's where the approach gets imprecise.

For now, we use a bit-spinlock to sync the large mapcount + MM IDs + MM
mapcounts, and make sure we do keep the machinery fast, to not degrade
(un)map performance too much: for example, we make sure to only use a
single atomic (when grabbing the bit-spinlock), like we would already
perform when updating the large mapcount.

In the future, we might be able to use an arch_spin_lock(), but that's
future work.

3) CONFIG_NO_PAGE_MAPCOUNT
==========================

patch #11 -> #17 spell out and document what exactly is affected when
not maintaining the per-page mapcounts in large folios anymore.

For example, as we cannot maintain folio->_nr_pages_mapped anymore when
(un)mapping pages, we'll account a complete folio as mapped if a
single page is mapped.

As another example, we might now under-estimate the USS (Unique Set Size)
of a process, but never over-estimate it.

With a more elaborate approach for MM-owner tracking like #1, some things
could be improved (e.g., USS to some degree), but somethings just cannot be
handled like we used to without these per-page mapcounts (e.g.,
folio->_nr_pages_mapped).

4) Performance
==============

The following kernel config combinations are possible:

* Base: CONFIG_PAGE_MAPCOUNT
   -> (existing) page-mapcount tracking
* MM-ID: CONFIG_MM_ID && CONFIG_PAGE_MAPCOUNT
   -> page-mapcount + MM-ID tracking
* No-Mapcount: CONFIG_MM_ID && CONFIG_NO_PAGE_MAPCOUNT
   -> MM-ID tracking

I run my PTE-mapped-THP microbenchmarks [3] and vm-scalability on a machine
with two NUMA nodes, with a 10-core Intel(R) Xeon(R) Silver 4210R CPU @
2.40GHz and 16 GiB of memory each.

4.1) PTE-mapped-THP microbenchmarks
-----------------------------------

All benchmarks allocate 1 GiB of THPs of a given size, to then fork()/
munmap/... PMD-sized THPs are mapped by PTEs first.

Numbers are increase (+) / reduction (-) in runtime. Reduction (-) is
good. "Base" is the baseline.

munmap: munmap() the allocated memory.

Folio Size |  MM-ID | No-Mapcount
--------------------------------
     16 KiB |   2 % |        -8 %
     32 KiB |   3 % |        -9 %
     64 KiB |   4 % |       -16 %
    128 KiB |   3 % |       -17 %
    256 KiB |   1 % |       -23 %
    512 KiB |   1 % |       -26 %
   1024 KiB |   0 % |       -29 %
   2048 KiB |   0 % |       -31 %

-> 32-128 with MM-ID are a bit unexpected: we would expect to see the worst
    case with the smallest size (16 KiB). But for these sizes also the STDEV
    is between 1% and 2%, in contrast to the others (< 1 %). Maybe some
    weird interaction with PCP/buddy.

fork: fork()

Folio Size |  MM-ID | No-Mapcount
--------------------------------
     16 KiB |    4 % |       -9 %
     32 KiB |    1 % |      -12 %
     64 KiB |    0 % |      -15 %
    128 KiB |    0 % |      -15 %
    256 KiB |    0 % |      -16 %
    512 KiB |    0 % |      -16 %
   1024 KiB |    0 % |      -17 %
   2048 KiB |   -1 % |      -21 %

-> Slight slowdown with MM-ID for the smallest folio size (more what we
expect in contrast to munmap()).

cow-byte: fork() and keep the child running. write one byte to each
   individual page, measuring the duration of all writes.

Folio Size |  MM-ID | No-Mapcount
--------------------------------
     16 KiB |    0 % |        0 %
     32 KiB |    0 % |        0 %
     64 KiB |    0 % |        0 %
    128 KiB |    0 % |        0 %
    256 KiB |    0 % |        0 %
    512 KiB |    0 % |        0 %
   1024 KiB |    0 % |        0 %
   2048 KiB |    0 % |        0 %

-> All other overhead dominates even when effectively unmapping
    single pages of large folios when replacing them by a copy during write
    faults. No change, which is great!

reuse-byte: fork() and wait until the child quit. write one byte to each
   individual page, measuring the duration of all writes.

Folio Size |  MM-ID | No-Mapcount
--------------------------------
     16 KiB |  -66 % |      -66 %
     32 KiB |  -65 % |      -65 %
     64 KiB |  -64 % |      -64 %
    128 KiB |  -64 % |      -64 %
    256 KiB |  -64 % |      -64 %
    512 KiB |  -64 % |      -64 %
   1024 KiB |  -64 % |      -64 %
   2048 KiB |  -64 % |      -64 %

-> No surprise, we reuse all pages instead of copying them.

child-reuse-bye: fork() and unmap the memory in the parent. write one byte
   to each individual page in the child, measuring the duration of all writes.

Folio Size |  MM-ID | No-Mapcount
--------------------------------
     16 KiB |  -66 % |      -66 %
     32 KiB |  -65 % |      -65 %
     64 KiB |  -64 % |      -64 %
    128 KiB |  -64 % |      -64 %
    256 KiB |  -64 % |      -64 %
    512 KiB |  -64 % |      -64 %
   1024 KiB |  -64 % |      -64 %
   2048 KiB |  -64 % |      -64 %

-> Same thing, we reuse all pages instead of copying them.

For 4 KiB, there is no change in any benchmark, as expected.

4.2) vm-scalability
-------------------

For now I only ran anon COW tests. I use 1 GiB per child process and use
one child per core (-> 20).

case-anon-cow-rand: random writes

There is effectively no change (<0.6% throughput difference).

case-anon-cow-seq: sequential writes

MM-ID has up to 2% *lower* throughout than Base, not really correlating to
folio size. The difference is almost as large as the STDEV (1% - 2%),
though. It looks like there is a very slight effective slowdown.

No-Mapcount has up to 3% *higher* throughput than Base, not really
correlating to the folio size. However, also here the difference is almost
as large as the STDEV (up to 2%). It looks like there is a very slight
effective speedup.

In summary, no earth-shattering slowdown with MM-ID (and we just recently
optimized folio->_nr_pages_mapped to give us some speedup :) ), and
another nice improvement with No-Mapcount.

I did a bunch of cross-compiles and the build bots turned out very helpful
over the last months. I did quite some testing with LTP and selftests,
but x86-64 only.

Gentle ping. I might soon have capacity to continue working on this. If 
there is no further feedback I'll rebase and resend.

--
Cheers,

David / dhildenb