[...]
Just a note (that you maybe already know) that we have to be a bit careful in
the wp_copy path with replacing sub-pages that are marked exclusive.
Ahh, no I wasn't aware of this - thanks for taking the time to explain it. I
think I have a bug.
(I'm guessing the GUP fast path assumes that if it sees an exclusive page then
that page can't go away? And if I then wp_copy it, I'm breaking the assumption?
But surely user space can munmap it at any time and the consequences are
similar? It's probably clear that I don't know much about the GUP implementation
details...)
If GUP finds a read-only PTE pointing at an exclusive subpage, it
assumes that this page cannot randomly be replaced by core MM due to
COW. See gup_must_unshare(). So it can go ahead and pin the page. As
long as user space doesn't do something stupid with the mapping
(MADV_DONTNEED, munmap()) the pinned page must correspond to the mapped
page.
If GUP finds a writeable PTE, it assumes that this page cannot randomly
be replaced by core MM due to COW -- because writable implies exclusive.
See, for example the VM_BUG_ON_PAGE() in follow_page_pte(). So,
similarly, GUP can simply go ahead and pin the page.
GUP-fast runs lockless, not even taking the PT locks. It syncs against
concurrent fork() using a special seqlock, and essentially unpins
whatever it temporarily pinned when it detects that fork() was running
concurrently. But it might result in some pages temporarily being
flagged as "maybe pinned".
In other cases (!fork()), GUP-fast synchronizes against concurrent
sharing (KSM) or unmapping (migration, swapout) that implies clearing of
the PG_anon_flag of the subpage by first unmapping the PTE and
conditionally remapping it. See mm/ksm.c:write_protect_page() as an
example for the sharing side (especially: if page_try_share_anon_rmap()
fails because the page may be pinned).
Long story short: replacing a r-o "maybe shared" (!exclusive) PTE is
easy. Replacing an exclusive PTE (including writable PTEs) requires some
work to sync with GUP-fast and goes rather in the "maybe just don't
bother" terriroty.
My current patch always prefers reuse over copy, and expands the size of the
reuse to the biggest set of pages that are all exclusive (determined either by
the presence of the anon_exclusive flag or from the refcount), and covered by
the same folio (and a few other bounds constraints - see
calc_anon_folio_range_reuse()).
If I determine I must copy (because the "anchor" page is not exclusive), then I
determine the size of the copy region based on a few constraints (see
calc_anon_folio_order_copy()). But I think you are saying that no pages in that
region are allowed to have the anon_exclusive flag set? In which case, this
could be fixed by adding that check in the function.
Yes, changing a PTE that points at an anonymous subpage that has the
"exclusive" flag set requires more care.
Currently, we always only replace a single shared anon (sub)page by a fresh
exclusive base-page during a write-fault/unsharing. As the sub-page is already
marked "maybe shared", it cannot get pinned concurrently and everybody is happy.
When you say, "maybe shared" is that determined by the absence of the
"exclusive" flag?
Yes. Semantics of PG_anon_exclusive are "exclusive" vs. "maybe shared".
Once "maybe shared", we must only go back to "exclusive (set the flag)
if we are sure that there are no other references to the page.
If you now decide to replace more subpages, you have to be careful that none of
them are still exclusive -- because they could get pinned concurrently and
replacing them would result in memory corruptions.
There are scenarios (most prominently: MADV_WIPEONFORK), but also failed partial
fork() that could result in something like that.
Are there any test cases that stress the kernel in this area that I could use to
validate my fix?
tools/testing/selftests/mm/cow.c does excessive tests (including some
MADV_DONTFORK -- that's what I actually meant -- and partial mremap
tests), but mostly focuses on ordinary base pages (order-0), THP, and
hugetlb.
We don't have any "GUP-fast racing with fork()" tests or similar yet
(tests that rely on races are not a good candidate for selftests).
We might want to extend tools/testing/selftests/mm/cow.c to test for
some of the cases you extend.
We may also change the detection of THP (I think, by luck, it would
currently also test your patches to some degree set the way it tests for
THP)
if (!pagemap_is_populated(pagemap_fd, mem + pagesize)) {
ksft_test_result_skip("Did not get a THP populated\n");
goto munmap;
}
Would have to be, for example,
if (!pagemap_is_populated(pagemap_fd, mem + thpsize - pagesize)) {
ksft_test_result_skip("Did not get a THP populated\n");
goto munmap;
}
Because we touch the first PTE in a PMD and want to test if core-mm gave
us a full THP (last PTE also populated).
Extending the tests to cover other anon THP sizes could work by aligning
a VMA to THP/2 size (so we're sure we don't get a full THP), and then
testing if we get more PTEs populated -> your code active.
Further, we have to be a bit careful regarding replacing ranges that are backed
by different anon pages (for example, due to fork() deciding to copy some
sub-pages of a PTE-mapped folio instead of sharing all sub-pages).
I don't understand this statement; do you mean "different anon _folios_"? I am
scanning the page table to expand the region that I reuse/copy and as part of
that scan, make sure that I only cover a single folio. So I think I conform here
- the scan would give up once it gets to the hole.
During fork(), what could happen (temporary detection of pinned page
resulting in a copy) is something weird like:
PTE 0: subpage0 of anon page #1 (maybe shared)
PTE 1: subpage1 of anon page #1 (maybe shared
PTE 2: anon page #2 (exclusive)
PTE 3: subpage2 of anon page #1 (maybe shared
Of course, any combination of above.
Further, with mremap() we might get completely crazy layouts, randomly
mapping sub-pages of anon pages, mixed with other sub-pages or base-page
folios.
Maybe it's all handled already by your code, just pointing out which
kind of mess we might get :)
So what should be safe is replacing all sub-pages of a folio that are marked
"maybe shared" by a new folio under PT lock. However, I wonder if it's really
worth the complexity. For THP we were happy so far to *not* optimize this,
implying that maybe we shouldn't worry about optimizing the fork() case for now
that heavily.
I don't have the exact numbers to hand, but I'm pretty sure I remember enabling
large copies was contributing a measurable amount to the performance
improvement. (Certainly, the zero-page copy case, is definitely a big
contributer). I don't have access to the HW at the moment but can rerun later
with and without to double check.
In which test exactly? Some micro-benchmark?
One optimization once could think of instead (that I raised previously in other
context) is the detection of exclusivity after fork()+exit in the child (IOW,
only the parent continues to exist). Once PG_anon_exclusive was cleared for all
sub-pages of the THP-mapped folio during fork(), we'd always decide to copy
instead of reuse (because page_count() > 1, as the folio is PTE mapped).
Scanning the surrounding page table if it makes sense (e.g., page_count() <=
folio_nr_pages()), to test if all page references are from the current process
would allow for reusing the folio (setting PG_anon_exclusive) for the sub-pages.
The smaller the folio order, the cheaper this "scan surrounding PTEs" scan is.
For THP, which are usually PMD-mapped even after fork()+exit, we didn't add this
optimization.
Yes, I have already implemented this in my series; see patch 10.
Oh, good! That's the most important part.
--
Thanks,
David / dhildenb