Re: [PATCH v1] mm/khugepaged: replace page_mapcount() check by folio_likely_mapped_shared()

David Hildenbrand <david@xxxxxxxxxx> · Wed, 24 Apr 2024 18:36:32 +0200

On 24.04.24 18:28, Yang Shi wrote:
On Wed, Apr 24, 2024 at 5:26 AM David Hildenbrand <david@xxxxxxxxxx> wrote:

We want to limit the use of page_mapcount() to places where absolutely
required, to prepare for kernel configs where we won't keep track of
per-page mapcounts in large folios.

khugepaged is one of the remaining "more challenging" page_mapcount()
users, but we might be able to move away from page_mapcount() without
resulting in a significant behavior change that would warrant
special-casing based on kernel configs.

In 2020, we first added support to khugepaged for collapsing COW-shared
pages via commit 9445689f3b61 ("khugepaged: allow to collapse a page shared
across fork"), followed by support for collapsing PTE-mapped THP in commit
5503fbf2b0b8 ("khugepaged: allow to collapse PTE-mapped compound pages")
and limiting the memory waste via the "page_count() > 1" check in commit
71a2c112a0f6 ("khugepaged: introduce 'max_ptes_shared' tunable").

As a default, khugepaged will allow up to half of the PTEs to map shared
pages: where page_mapcount() > 1. MADV_COLLAPSE ignores the khugepaged
setting.

khugepaged does currently not care about swapcache page references, and
does not check under folio lock: so in some corner cases the "shared vs.
exclusive" detection might be a bit off, making us detect "exclusive" when
it's actually "shared".

Most of our anonymous folios in the system are usually exclusive. We
frequently see sharing of anonymous folios for a short period of time,
after which our short-lived suprocesses either quit or exec().

There are some famous examples, though, where child processes exist for a
long time, and where memory is COW-shared with a lot of processes
(webservers, webbrowsers, sshd, ...) and COW-sharing is crucial for
reducing the memory footprint. We don't want to suddenly change the
behavior to result in a significant increase in memory waste.

Interestingly, khugepaged will only collapse an anonymous THP if at least
one PTE is writable. After fork(), that means that something (usually a
page fault) populated at least a single exclusive anonymous THP in that PMD
range.

So ... what happens when we switch to "is this folio mapped shared"
instead of "is this page mapped shared" by using
folio_likely_mapped_shared()?

For "not-COW-shared" folios, small folios and for THPs (large
folios) that are completely mapped into at least one process,
switching to folio_likely_mapped_shared() will not result in a change.

We'll only see a change for COW-shared PTE-mapped THPs that are
partially mapped into all involved processes.

There are two cases to consider:

(A) folio_likely_mapped_shared() returns "false" for a PTE-mapped THP

   If the folio is detected as exclusive, and it actually is exclusive,
   there is no change: page_mapcount() == 1. This is the common case
   without fork() or with short-lived child processes.

   folio_likely_mapped_shared() might currently still detect a folio as
   exclusive although it is shared (false negatives): if the first page is
   not mapped multiple times and if the average per-page mapcount is smaller
   than 1, implying that (1) the folio is partially mapped and (2) if we are
   responsible for many mapcounts by mapping many pages others can't
   ("mostly exclusive") (3) if we are not responsible for many mapcounts by
   mapping little pages ("mostly shared") it won't make a big impact on the
   end result.

   So while we might now detect a page as "exclusive" although it isn't,
   it's not expected to make a big difference in common cases.

(B) folio_likely_mapped_shared() returns "true" for a PTE-mapped THP

   folio_likely_mapped_shared() will never detect a large anonymous folio
   as shared although it is exclusive: there are no false positives.

   If we detect a THP as shared, at least one page of the THP is mapped by
   another process. It could well be that some pages are actually exclusive.
   For example, our child processes could have unmapped/COW'ed some pages
   such that they would now be exclusive to out process, which we now
   would treat as still-shared.

IIUC, case A may under-count shared PTEs, however on the opposite side
case B may over-count shared PTEs, right? So the impact may depend on
what value is used by max_shared_ptes tunable. It may have a more
noticeable impact on a very conservative setting (i.e. max_shared_ptes
== 1) if it is under-counted or on a very aggressive setting (i.e.
max_shared_ptes == 510) if it is over-counted.

Thanks for reading all of that!

Right, and mostly affecting corner cases. I'm not concerned about (B) 
really. I was more concerned about (A) before I optimized 
folio_likely_mapped_shared() using the large mapcount.

So I agree it should not matter much for common cases. AFAIK, the
usecase for aggressive setting should be very rare, but conservative
setting may be more usual, so improving the under-count for
conservative setting may be worth it.

Yes, sorting out A completely is what I 'm working on, but it will 
likely not be available on all kernel configs, at least initially.

And unless there is a good reason, I want to avoid having 
config-dependent stuff all over the kernel -- and just move 
page_mapcount() to task_mmu.c where it can no longer be (ab)used.

Thanks!

--
Cheers,

David / dhildenb