Re: [RFC v2 PATCH 00/17] variable-order, large folios for anonymous memory

David Hildenbrand <david@xxxxxxxxxx> · Mon, 17 Apr 2023 16:05:23 +0200

[...]

Just a note (that you maybe already know) that we have to be a bit careful in
the wp_copy path with replacing sub-pages that are marked exclusive.

Ahh, no I wasn't aware of this - thanks for taking the time to explain it. I
think I have a bug.

(I'm guessing the GUP fast path assumes that if it sees an exclusive page then
that page can't go away? And if I then wp_copy it, I'm breaking the assumption?
But surely user space can munmap it at any time and the consequences are
similar? It's probably clear that I don't know much about the GUP implementation
details...)

If GUP finds a read-only PTE pointing at an exclusive subpage, it 
assumes that this page cannot randomly be replaced by core MM due to 
COW. See gup_must_unshare(). So it can go ahead and pin the page. As 
long as user space doesn't do something stupid with the mapping 
(MADV_DONTNEED, munmap()) the pinned page must correspond to the mapped 
page.

If GUP finds a writeable PTE, it assumes that this page cannot randomly 
be replaced by core MM due to COW -- because writable implies exclusive. 
See, for example the VM_BUG_ON_PAGE() in follow_page_pte(). So, 
similarly, GUP can simply go ahead and pin the page.

GUP-fast runs lockless, not even taking the PT locks. It syncs against 
concurrent fork() using a special seqlock, and essentially unpins 
whatever it temporarily pinned when it detects that fork() was running 
concurrently. But it might result in some pages temporarily being 
flagged as "maybe pinned".

In other cases (!fork()), GUP-fast synchronizes against concurrent 
sharing (KSM) or unmapping (migration, swapout) that implies clearing of 
the PG_anon_flag of the subpage by first unmapping the PTE and 
conditionally remapping it. See mm/ksm.c:write_protect_page() as an 
example for the sharing side (especially: if page_try_share_anon_rmap() 
fails because the page may be pinned).

Long story short: replacing a r-o "maybe shared" (!exclusive) PTE is 
easy. Replacing an exclusive PTE (including writable PTEs) requires some 
work to sync with GUP-fast and goes rather in the "maybe just don't 
bother" terriroty.

My current patch always prefers reuse over copy, and expands the size of the
reuse to the biggest set of pages that are all exclusive (determined either by
the presence of the anon_exclusive flag or from the refcount), and covered by
the same folio (and a few other bounds constraints - see
calc_anon_folio_range_reuse()).

If I determine I must copy (because the "anchor" page is not exclusive), then I
determine the size of the copy region based on a few constraints (see
calc_anon_folio_order_copy()). But I think you are saying that no pages in that
region are allowed to have the anon_exclusive flag set? In which case, this
could be fixed by adding that check in the function.

Yes, changing a PTE that points at an anonymous subpage that has the 
"exclusive" flag set requires more care.

Currently, we always only replace a single shared anon (sub)page by a fresh
exclusive base-page during a write-fault/unsharing. As the sub-page is already
marked "maybe shared", it cannot get pinned concurrently and everybody is happy.

When you say, "maybe shared" is that determined by the absence of the
"exclusive" flag?

Yes. Semantics of PG_anon_exclusive are "exclusive" vs. "maybe shared". 
Once "maybe shared", we must only go back to "exclusive (set the flag) 
if we are sure that there are no other references to the page.

If you now decide to replace more subpages, you have to be careful that none of
them are still exclusive -- because they could get pinned concurrently and
replacing them would result in memory corruptions.

There are scenarios (most prominently: MADV_WIPEONFORK), but also failed partial
fork() that could result in something like that.

Are there any test cases that stress the kernel in this area that I could use to
validate my fix?

tools/testing/selftests/mm/cow.c does excessive tests (including some 
MADV_DONTFORK -- that's what I actually meant --  and partial mremap 
tests), but mostly focuses on ordinary base pages (order-0), THP, and 
hugetlb.

We don't have any "GUP-fast racing with fork()" tests or similar yet 
(tests that rely on races are not a good candidate for selftests).

We might want to extend tools/testing/selftests/mm/cow.c to test for 
some of the cases you extend.

We may also change the detection of THP (I think, by luck, it would 
currently also test your patches to some degree set the way it tests for 
THP)

if (!pagemap_is_populated(pagemap_fd, mem + pagesize)) {
	ksft_test_result_skip("Did not get a THP populated\n");
	goto munmap;
}

Would have to be, for example,

if (!pagemap_is_populated(pagemap_fd, mem + thpsize - pagesize)) {
	ksft_test_result_skip("Did not get a THP populated\n");
	goto munmap;
}

Because we touch the first PTE in a PMD and want to test if core-mm gave 
us a full THP (last PTE also populated).

Extending the tests to cover other anon THP sizes could work by aligning 
a VMA to THP/2 size (so we're sure we don't get a full THP), and then 
testing if we get more PTEs populated -> your code active.

Further, we have to be a bit careful regarding replacing ranges that are backed
by different anon pages (for example, due to fork() deciding to copy some
sub-pages of a PTE-mapped folio instead of sharing all sub-pages).

I don't understand this statement; do you mean "different anon _folios_"? I am
scanning the page table to expand the region that I reuse/copy and as part of
that scan, make sure that I only cover a single folio. So I think I conform here
- the scan would give up once it gets to the hole.

During fork(), what could happen (temporary detection of pinned page 
resulting in a copy) is something weird like:

PTE 0: subpage0 of anon page #1 (maybe shared)
PTE 1: subpage1 of anon page #1 (maybe shared
PTE 2: anon page #2 (exclusive)
PTE 3: subpage2 of anon page #1 (maybe shared

Of course, any combination of above.

Further, with mremap() we might get completely crazy layouts, randomly 
mapping sub-pages of anon pages, mixed with other sub-pages or base-page 
folios.

Maybe it's all handled already by your code, just pointing out which 
kind of mess we might get :)

So what should be safe is replacing all sub-pages of a folio that are marked
"maybe shared" by a new folio under PT lock. However, I wonder if it's really
worth the complexity. For THP we were happy so far to *not* optimize this,
implying that maybe we shouldn't worry about optimizing the fork() case for now
that heavily.

I don't have the exact numbers to hand, but I'm pretty sure I remember enabling
large copies was contributing a measurable amount to the performance
improvement. (Certainly, the zero-page copy case, is definitely a big
contributer). I don't have access to the HW at the moment but can rerun later
with and without to double check.

In which test exactly? Some micro-benchmark?

One optimization once could think of instead (that I raised previously in other
context) is the detection of exclusivity after fork()+exit in the child (IOW,
only the parent continues to exist). Once PG_anon_exclusive was cleared for all
sub-pages of the THP-mapped folio during fork(), we'd always decide to copy
instead of reuse (because page_count() > 1, as the folio is PTE mapped).
Scanning the surrounding page table if it makes sense (e.g., page_count() <=
folio_nr_pages()), to test if all page references are from the current process
would allow for reusing the folio (setting PG_anon_exclusive) for the sub-pages.
The smaller the folio order, the cheaper this "scan surrounding PTEs" scan is.
For THP, which are usually PMD-mapped even after fork()+exit, we didn't add this
optimization.

Yes, I have already implemented this in my series; see patch 10.

Oh, good! That's the most important part.

--
Thanks,

David / dhildenb