With rmap batching from [1] -- rebased+changed on top of that -- we could turn that into an effective (untested): if (page && folio_test_anon(folio)) { + nr = folio_nr_pages_cont_mapped(folio, page, src_pte, addr, end, + pte, enforce_uffd_wp, &nr_dirty, + &nr_writable); /* * If this page may have been pinned by the parent process, * copy the page immediately for the child so that we'll always * guarantee the pinned page won't be randomly replaced in the * future. */ - folio_get(folio); - if (unlikely(folio_try_dup_anon_rmap_pte(folio, page, src_vma))) { + folio_ref_add(folio, nr); + if (unlikely(folio_try_dup_anon_rmap_ptes(folio, page, nr, src_vma))) { /* Page may be pinned, we have to copy. */ - folio_put(folio); - return copy_present_page(dst_vma, src_vma, dst_pte, src_pte, - addr, rss, prealloc, page); + folio_ref_sub(folio, nr); + ret = copy_present_page(dst_vma, src_vma, dst_pte, + src_pte, addr, rss, prealloc, + page); + return ret == 0 ? 1 : ret; } - rss[MM_ANONPAGES]++; + rss[MM_ANONPAGES] += nr; } else if (page) { - folio_get(folio); - folio_dup_file_rmap_pte(folio, page); - rss[mm_counter_file(page)]++; + nr = folio_nr_pages_cont_mapped(folio, page, src_pte, addr, end, + pte, enforce_uffd_wp, &nr_dirty, + &nr_writable); + folio_ref_add(folio, nr); + folio_dup_file_rmap_ptes(folio, page, nr); + rss[mm_counter_file(page)] += nr; } We'll have to test performance, but it could be that we want to specialize more on !folio_test_large(). That code is very performance-sensitive. [1] https://lkml.kernel.org/r/20231204142146.91437-1-david@xxxxxxxxxx
So, on top of [1] without rmap batching but with a slightly modified version of yours (that keeps the existing code structure as pointed out and e.g., updates counter updates), running my fork() microbenchmark with a 1 GiB of memory:
Compared to [1], with all order-0 pages it gets 13--14% _slower_ and with all PTE-mapped THP (order-9) it gets ~29--30% _faster_.
So looks like we really want to have a completely seprate code path for "!folio_test_large()" to keep that case as fast as possible. And "Likely" we want to use "likely(!folio_test_large()". ;)
Performing rmap batching on top of that code only slightly (another 1% or so) improves performance in the PTE-mapped THP (order-9) case right now, in contrast to other rmap batching. Reason is as all rmap code gets inlined here and we're only doing subpage mapcount updates + PAE handling.
-- Cheers, David / dhildenb