Re: [PATCH v3 01/15] mm: Batch-copy PTE ranges during fork()

David Hildenbrand <david@xxxxxxxxxx> · Mon, 4 Dec 2023 18:27:08 +0100

With rmap batching from [1] -- rebased+changed on top of that -- we could turn
that into an effective (untested):

          if (page && folio_test_anon(folio)) {
+               nr = folio_nr_pages_cont_mapped(folio, page, src_pte, addr, end,
+                                               pte, enforce_uffd_wp, &nr_dirty,
+                                               &nr_writable);
                  /*
                   * If this page may have been pinned by the parent process,
                   * copy the page immediately for the child so that we'll always
                   * guarantee the pinned page won't be randomly replaced in the
                   * future.
                   */
-               folio_get(folio);
-               if (unlikely(folio_try_dup_anon_rmap_pte(folio, page, src_vma))) {
+               folio_ref_add(folio, nr);
+               if (unlikely(folio_try_dup_anon_rmap_ptes(folio, page, nr, src_vma))) {
                          /* Page may be pinned, we have to copy. */
-                       folio_put(folio);
-                       return copy_present_page(dst_vma, src_vma, dst_pte, src_pte,
-                                                addr, rss, prealloc, page);
+                       folio_ref_sub(folio, nr);
+                       ret = copy_present_page(dst_vma, src_vma, dst_pte,
+                                               src_pte, addr, rss, prealloc,
+                                               page);
+                       return ret == 0 ? 1 : ret;
                  }
-               rss[MM_ANONPAGES]++;
+               rss[MM_ANONPAGES] += nr;
          } else if (page) {
-               folio_get(folio);
-               folio_dup_file_rmap_pte(folio, page);
-               rss[mm_counter_file(page)]++;
+               nr = folio_nr_pages_cont_mapped(folio, page, src_pte, addr, end,
+                                               pte, enforce_uffd_wp, &nr_dirty,
+                                               &nr_writable);
+               folio_ref_add(folio, nr);
+               folio_dup_file_rmap_ptes(folio, page, nr);
+               rss[mm_counter_file(page)] += nr;
          }

We'll have to test performance, but it could be that we want to specialize
more on !folio_test_large(). That code is very performance-sensitive.

[1] https://lkml.kernel.org/r/20231204142146.91437-1-david@xxxxxxxxxx

So, on top of [1] without rmap batching but with a slightly modified 
version of yours (that keeps the existing code structure as pointed out 
and e.g., updates counter updates), running my fork() microbenchmark 
with a 1 GiB of memory:

Compared to [1], with all order-0 pages it gets 13--14% _slower_ and 
with all PTE-mapped THP (order-9) it gets ~29--30% _faster_.

So looks like we really want to have a completely seprate code path for 
"!folio_test_large()" to keep that case as fast as possible. And 
"Likely" we want to use "likely(!folio_test_large()". ;)

Performing rmap batching on top of that code only slightly (another 1% 
or so) improves performance in the PTE-mapped THP (order-9) case right 
now, in contrast to other rmap batching. Reason is as all rmap code gets 
inlined here and we're only doing subpage mapcount updates + PAE handling.

--
Cheers,

David / dhildenb