Re: [RFC v2 PATCH 00/17] variable-order, large folios for anonymous memory

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



[...]

Just a note (that you maybe already know) that we have to be a bit careful in
the wp_copy path with replacing sub-pages that are marked exclusive.

Ahh, no I wasn't aware of this - thanks for taking the time to explain it. I
think I have a bug.

(I'm guessing the GUP fast path assumes that if it sees an exclusive page then
that page can't go away? And if I then wp_copy it, I'm breaking the assumption?
But surely user space can munmap it at any time and the consequences are
similar? It's probably clear that I don't know much about the GUP implementation
details...)

If GUP finds a read-only PTE pointing at an exclusive subpage, it assumes that this page cannot randomly be replaced by core MM due to COW. See gup_must_unshare(). So it can go ahead and pin the page. As long as user space doesn't do something stupid with the mapping (MADV_DONTNEED, munmap()) the pinned page must correspond to the mapped page.

If GUP finds a writeable PTE, it assumes that this page cannot randomly be replaced by core MM due to COW -- because writable implies exclusive. See, for example the VM_BUG_ON_PAGE() in follow_page_pte(). So, similarly, GUP can simply go ahead and pin the page.

GUP-fast runs lockless, not even taking the PT locks. It syncs against concurrent fork() using a special seqlock, and essentially unpins whatever it temporarily pinned when it detects that fork() was running concurrently. But it might result in some pages temporarily being flagged as "maybe pinned".

In other cases (!fork()), GUP-fast synchronizes against concurrent sharing (KSM) or unmapping (migration, swapout) that implies clearing of the PG_anon_flag of the subpage by first unmapping the PTE and conditionally remapping it. See mm/ksm.c:write_protect_page() as an example for the sharing side (especially: if page_try_share_anon_rmap() fails because the page may be pinned).

Long story short: replacing a r-o "maybe shared" (!exclusive) PTE is easy. Replacing an exclusive PTE (including writable PTEs) requires some work to sync with GUP-fast and goes rather in the "maybe just don't bother" terriroty.


My current patch always prefers reuse over copy, and expands the size of the
reuse to the biggest set of pages that are all exclusive (determined either by
the presence of the anon_exclusive flag or from the refcount), and covered by
the same folio (and a few other bounds constraints - see
calc_anon_folio_range_reuse()).

If I determine I must copy (because the "anchor" page is not exclusive), then I
determine the size of the copy region based on a few constraints (see
calc_anon_folio_order_copy()). But I think you are saying that no pages in that
region are allowed to have the anon_exclusive flag set? In which case, this
could be fixed by adding that check in the function.

Yes, changing a PTE that points at an anonymous subpage that has the "exclusive" flag set requires more care.



Currently, we always only replace a single shared anon (sub)page by a fresh
exclusive base-page during a write-fault/unsharing. As the sub-page is already
marked "maybe shared", it cannot get pinned concurrently and everybody is happy.

When you say, "maybe shared" is that determined by the absence of the
"exclusive" flag?

Yes. Semantics of PG_anon_exclusive are "exclusive" vs. "maybe shared". Once "maybe shared", we must only go back to "exclusive (set the flag) if we are sure that there are no other references to the page.



If you now decide to replace more subpages, you have to be careful that none of
them are still exclusive -- because they could get pinned concurrently and
replacing them would result in memory corruptions.

There are scenarios (most prominently: MADV_WIPEONFORK), but also failed partial
fork() that could result in something like that.

Are there any test cases that stress the kernel in this area that I could use to
validate my fix?

tools/testing/selftests/mm/cow.c does excessive tests (including some MADV_DONTFORK -- that's what I actually meant -- and partial mremap tests), but mostly focuses on ordinary base pages (order-0), THP, and hugetlb.

We don't have any "GUP-fast racing with fork()" tests or similar yet (tests that rely on races are not a good candidate for selftests).

We might want to extend tools/testing/selftests/mm/cow.c to test for some of the cases you extend.

We may also change the detection of THP (I think, by luck, it would currently also test your patches to some degree set the way it tests for THP)

if (!pagemap_is_populated(pagemap_fd, mem + pagesize)) {
	ksft_test_result_skip("Did not get a THP populated\n");
	goto munmap;
}

Would have to be, for example,

if (!pagemap_is_populated(pagemap_fd, mem + thpsize - pagesize)) {
	ksft_test_result_skip("Did not get a THP populated\n");
	goto munmap;
}

Because we touch the first PTE in a PMD and want to test if core-mm gave us a full THP (last PTE also populated).


Extending the tests to cover other anon THP sizes could work by aligning a VMA to THP/2 size (so we're sure we don't get a full THP), and then testing if we get more PTEs populated -> your code active.



Further, we have to be a bit careful regarding replacing ranges that are backed
by different anon pages (for example, due to fork() deciding to copy some
sub-pages of a PTE-mapped folio instead of sharing all sub-pages).

I don't understand this statement; do you mean "different anon _folios_"? I am
scanning the page table to expand the region that I reuse/copy and as part of
that scan, make sure that I only cover a single folio. So I think I conform here
- the scan would give up once it gets to the hole.

During fork(), what could happen (temporary detection of pinned page resulting in a copy) is something weird like:

PTE 0: subpage0 of anon page #1 (maybe shared)
PTE 1: subpage1 of anon page #1 (maybe shared
PTE 2: anon page #2 (exclusive)
PTE 3: subpage2 of anon page #1 (maybe shared

Of course, any combination of above.

Further, with mremap() we might get completely crazy layouts, randomly mapping sub-pages of anon pages, mixed with other sub-pages or base-page folios.

Maybe it's all handled already by your code, just pointing out which kind of mess we might get :)




So what should be safe is replacing all sub-pages of a folio that are marked
"maybe shared" by a new folio under PT lock. However, I wonder if it's really
worth the complexity. For THP we were happy so far to *not* optimize this,
implying that maybe we shouldn't worry about optimizing the fork() case for now
that heavily.

I don't have the exact numbers to hand, but I'm pretty sure I remember enabling
large copies was contributing a measurable amount to the performance
improvement. (Certainly, the zero-page copy case, is definitely a big
contributer). I don't have access to the HW at the moment but can rerun later
with and without to double check.

In which test exactly? Some micro-benchmark?




One optimization once could think of instead (that I raised previously in other
context) is the detection of exclusivity after fork()+exit in the child (IOW,
only the parent continues to exist). Once PG_anon_exclusive was cleared for all
sub-pages of the THP-mapped folio during fork(), we'd always decide to copy
instead of reuse (because page_count() > 1, as the folio is PTE mapped).
Scanning the surrounding page table if it makes sense (e.g., page_count() <=
folio_nr_pages()), to test if all page references are from the current process
would allow for reusing the folio (setting PG_anon_exclusive) for the sub-pages.
The smaller the folio order, the cheaper this "scan surrounding PTEs" scan is.
For THP, which are usually PMD-mapped even after fork()+exit, we didn't add this
optimization.

Yes, I have already implemented this in my series; see patch 10.

Oh, good! That's the most important part.

--
Thanks,

David / dhildenb





[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux