On 04/12/2023 17:27, David Hildenbrand wrote: >> >> With rmap batching from [1] -- rebased+changed on top of that -- we could turn >> that into an effective (untested): >> >> if (page && folio_test_anon(folio)) { >> + nr = folio_nr_pages_cont_mapped(folio, page, src_pte, addr, end, >> + pte, enforce_uffd_wp, &nr_dirty, >> + &nr_writable); >> /* >> * If this page may have been pinned by the parent process, >> * copy the page immediately for the child so that we'll always >> * guarantee the pinned page won't be randomly replaced in the >> * future. >> */ >> - folio_get(folio); >> - if (unlikely(folio_try_dup_anon_rmap_pte(folio, page, >> src_vma))) { >> + folio_ref_add(folio, nr); >> + if (unlikely(folio_try_dup_anon_rmap_ptes(folio, page, nr, >> src_vma))) { >> /* Page may be pinned, we have to copy. */ >> - folio_put(folio); >> - return copy_present_page(dst_vma, src_vma, dst_pte, >> src_pte, >> - addr, rss, prealloc, page); >> + folio_ref_sub(folio, nr); >> + ret = copy_present_page(dst_vma, src_vma, dst_pte, >> + src_pte, addr, rss, prealloc, >> + page); >> + return ret == 0 ? 1 : ret; >> } >> - rss[MM_ANONPAGES]++; >> + rss[MM_ANONPAGES] += nr; >> } else if (page) { >> - folio_get(folio); >> - folio_dup_file_rmap_pte(folio, page); >> - rss[mm_counter_file(page)]++; >> + nr = folio_nr_pages_cont_mapped(folio, page, src_pte, addr, end, >> + pte, enforce_uffd_wp, &nr_dirty, >> + &nr_writable); >> + folio_ref_add(folio, nr); >> + folio_dup_file_rmap_ptes(folio, page, nr); >> + rss[mm_counter_file(page)] += nr; >> } >> >> >> We'll have to test performance, but it could be that we want to specialize >> more on !folio_test_large(). That code is very performance-sensitive. >> >> >> [1] https://lkml.kernel.org/r/20231204142146.91437-1-david@xxxxxxxxxx > > So, on top of [1] without rmap batching but with a slightly modified version of Can you clarify what you mean by "without rmap batching"? I thought [1] implicitly adds rmap batching? (e.g. folio_dup_file_rmap_ptes(), which you've added in the code snippet above). > yours (that keeps the existing code structure as pointed out and e.g., updates > counter updates), running my fork() microbenchmark with a 1 GiB of memory: > > Compared to [1], with all order-0 pages it gets 13--14% _slower_ and with all > PTE-mapped THP (order-9) it gets ~29--30% _faster_. What test are you running - I'd like to reproduce if possible, since it sounds like I've got some work to do to remove the order-0 regression. > > So looks like we really want to have a completely seprate code path for > "!folio_test_large()" to keep that case as fast as possible. And "Likely" we > want to use "likely(!folio_test_large()". ;) Yuk, but fair enough. If I can repro the perf numbers, I'll have a go a reworking this. I think you're also implicitly suggesting that this change needs to depend on [1]? Which is a shame... I guess I should also go through a similar exercise for patch 2 in this series. > > Performing rmap batching on top of that code only slightly (another 1% or so) > improves performance in the PTE-mapped THP (order-9) case right now, in contrast > to other rmap batching. Reason is as all rmap code gets inlined here and we're > only doing subpage mapcount updates + PAE handling. >