On 05/12/2023 12:04, David Hildenbrand wrote: > On 05.12.23 12:30, Ryan Roberts wrote: >> On 04/12/2023 17:27, David Hildenbrand wrote: >>>> >>>> With rmap batching from [1] -- rebased+changed on top of that -- we could turn >>>> that into an effective (untested): >>>> >>>> if (page && folio_test_anon(folio)) { >>>> + nr = folio_nr_pages_cont_mapped(folio, page, src_pte, addr, >>>> end, >>>> + pte, enforce_uffd_wp, >>>> &nr_dirty, >>>> + &nr_writable); >>>> /* >>>> * If this page may have been pinned by the parent process, >>>> * copy the page immediately for the child so that we'll >>>> always >>>> * guarantee the pinned page won't be randomly replaced >>>> in the >>>> * future. >>>> */ >>>> - folio_get(folio); >>>> - if (unlikely(folio_try_dup_anon_rmap_pte(folio, page, >>>> src_vma))) { >>>> + folio_ref_add(folio, nr); >>>> + if (unlikely(folio_try_dup_anon_rmap_ptes(folio, page, nr, >>>> src_vma))) { >>>> /* Page may be pinned, we have to copy. */ >>>> - folio_put(folio); >>>> - return copy_present_page(dst_vma, src_vma, dst_pte, >>>> src_pte, >>>> - addr, rss, prealloc, page); >>>> + folio_ref_sub(folio, nr); >>>> + ret = copy_present_page(dst_vma, src_vma, dst_pte, >>>> + src_pte, addr, rss, prealloc, >>>> + page); >>>> + return ret == 0 ? 1 : ret; >>>> } >>>> - rss[MM_ANONPAGES]++; >>>> + rss[MM_ANONPAGES] += nr; >>>> } else if (page) { >>>> - folio_get(folio); >>>> - folio_dup_file_rmap_pte(folio, page); >>>> - rss[mm_counter_file(page)]++; >>>> + nr = folio_nr_pages_cont_mapped(folio, page, src_pte, addr, >>>> end, >>>> + pte, enforce_uffd_wp, >>>> &nr_dirty, >>>> + &nr_writable); >>>> + folio_ref_add(folio, nr); >>>> + folio_dup_file_rmap_ptes(folio, page, nr); >>>> + rss[mm_counter_file(page)] += nr; >>>> } >>>> >>>> >>>> We'll have to test performance, but it could be that we want to specialize >>>> more on !folio_test_large(). That code is very performance-sensitive. >>>> >>>> >>>> [1] https://lkml.kernel.org/r/20231204142146.91437-1-david@xxxxxxxxxx >>> >>> So, on top of [1] without rmap batching but with a slightly modified version of >> >> Can you clarify what you mean by "without rmap batching"? I thought [1] >> implicitly adds rmap batching? (e.g. folio_dup_file_rmap_ptes(), which you've >> added in the code snippet above). > > Not calling the batched variants but essentially doing what your code does (with > some minor improvements, like updating the rss counters only once). > > The snipped above is only linked below. I had the performance numbers for [1] > ready, so I gave it a test on top of that. > > To keep it simple, you might just benchmark w and w/o your patches. > >> >>> yours (that keeps the existing code structure as pointed out and e.g., updates >>> counter updates), running my fork() microbenchmark with a 1 GiB of memory: >>> >>> Compared to [1], with all order-0 pages it gets 13--14% _slower_ and with all >>> PTE-mapped THP (order-9) it gets ~29--30% _faster_. >> >> What test are you running - I'd like to reproduce if possible, since it sounds >> like I've got some work to do to remove the order-0 regression. > > Essentially just allocating 1 GiB of memory an measuring how long it takes to > call fork(). > > order-0 benchmarks: > > https://gitlab.com/davidhildenbrand/scratchspace/-/raw/main/order-0-benchmarks.c?ref_type=heads > > e.g.,: $ ./order-0-benchmarks fork 100 > > > pte-mapped-thp benchmarks: > > https://gitlab.com/davidhildenbrand/scratchspace/-/raw/main/pte-mapped-thp-benchmarks.c?ref_type=heads > > e.g.,: $ ./pte-mapped-thp-benchmarks fork 100 > > > Ideally, pin to one CPU and get stable performance numbers by disabling > SMT+turbo etc. This is great - thanks! I'll get to work... > >> >>> >>> So looks like we really want to have a completely seprate code path for >>> "!folio_test_large()" to keep that case as fast as possible. And "Likely" we >>> want to use "likely(!folio_test_large()". ;) >> >> Yuk, but fair enough. If I can repro the perf numbers, I'll have a go a >> reworking this. >> >> I think you're also implicitly suggesting that this change needs to depend on >> [1]? Which is a shame... > > Not necessarily. It certainly cleans up the code, but we can do that in any > order reasonable. > >> >> I guess I should also go through a similar exercise for patch 2 in this series. > > > Yes. There are "unmap" and "pte-dontneed" benchmarks contained in both files above. >