Re: [PATCH v2 01/14] mm: Batch-copy PTE ranges during fork()

Barry Song <21cnbao@xxxxxxxxx> · Mon, 27 Nov 2023 22:59:57 +1300

On Mon, Nov 27, 2023 at 10:35 PM Ryan Roberts <ryan.roberts@xxxxxxx> wrote:
>
> On 27/11/2023 08:42, Barry Song wrote:
> >>> +           for (i = 0; i < nr; i++, page++) {
> >>> +                   if (anon) {
> >>> +                           /*
> >>> +                            * If this page may have been pinned by the
> >>> +                            * parent process, copy the page immediately for
> >>> +                            * the child so that we'll always guarantee the
> >>> +                            * pinned page won't be randomly replaced in the
> >>> +                            * future.
> >>> +                            */
> >>> +                           if (unlikely(page_try_dup_anon_rmap(
> >>> +                                           page, false, src_vma))) {
> >>> +                                   if (i != 0)
> >>> +                                           break;
> >>> +                                   /* Page may be pinned, we have to copy. */
> >>> +                                   return copy_present_page(
> >>> +                                           dst_vma, src_vma, dst_pte,
> >>> +                                           src_pte, addr, rss, prealloc,
> >>> +                                           page);
> >>> +                           }
> >>> +                           rss[MM_ANONPAGES]++;
> >>> +                           VM_BUG_ON(PageAnonExclusive(page));
> >>> +                   } else {
> >>> +                           page_dup_file_rmap(page, false);
> >>> +                           rss[mm_counter_file(page)]++;
> >>> +                   }
> >>>             }
> >>> -           rss[MM_ANONPAGES]++;
> >>> -   } else if (page) {
> >>> -           folio_get(folio);
> >>> -           page_dup_file_rmap(page, false);
> >>> -           rss[mm_counter_file(page)]++;
> >>> +
> >>> +           nr = i;
> >>> +           folio_ref_add(folio, nr);
> >>
> >> You're changing the order of mapcount vs. refcount increment. Don't.
> >> Make sure your refcount >= mapcount.
> >>
> >> You can do that easily by doing the folio_ref_add(folio, nr) first and
> >> then decrementing in case of error accordingly. Errors due to pinned
> >> pages are the corner case.
> >>
> >> I'll note that it will make a lot of sense to have batch variants of
> >> page_try_dup_anon_rmap() and page_dup_file_rmap().
> >>
> >
> > i still don't understand why it is not a entire map+1, but an increment
> > in each basepage.
>
> Because we are PTE-mapping the folio, we have to account each individual page.
> If we accounted the entire folio, where would we unaccount it? Each page can be
> unmapped individually (e.g. munmap() part of the folio) so need to account each
> page. When PMD mapping, the whole thing is either mapped or unmapped, and its
> atomic, so we can account the entire thing.

Hi Ryan,

There is no problem. for example, a large folio is entirely mapped in
process A with CONPTE,
and only page2 is mapped in process B.
then we will have

entire_map = 0
page0.map = -1
page1.map = -1
page2.map = 0
page3.map = -1
....

>
> >
> > as long as it is a CONTPTE large folio, there is no much difference with
> > PMD-mapped large folio. it has all the chance to be DoubleMap and need
> > split.
> >
> > When A and B share a CONTPTE large folio, we do madvise(DONTNEED) or any
> > similar things on a part of the large folio in process A,
> >
> > this large folio will have partially mapped subpage in A (all CONTPE bits
> > in all subpages need to be removed though we only unmap a part of the
> > large folioas HW requires consistent CONTPTEs); and it has entire map in
> > process B(all PTEs are still CONPTES in process B).
> >
> > isn't it more sensible for this large folios to have entire_map = 0(for
> > process B), and subpages which are still mapped in process A has map_count
> > =0? (start from -1).
> >
> >> Especially, the batch variant of page_try_dup_anon_rmap() would only
> >> check once if the folio maybe pinned, and in that case, you can simply
> >> drop all references again. So you either have all or no ptes to process,
> >> which makes that code easier.
>
> I'm afraid this doesn't make sense to me. Perhaps I've misunderstood. But
> fundamentally you can only use entire_mapcount if its only possible to map and
> unmap the whole folio atomically.

My point is that CONTPEs should either all-set in all 16 PTEs or all are dropped
in 16 PTEs. if all PTEs have CONT, it is entirely mapped; otherwise,
it is partially
mapped. if a large folio is mapped in one processes with all CONTPTEs
and meanwhile in another process with partial mapping(w/o CONTPTE), it is
DoubleMapped.

Since we always hold ptl to set or drop CONTPTE bits, set/drop is
still atomic in a
spinlock area.

>
> >>
> >> But that can be added on top, and I'll happily do that.
> >>
> >> --
> >> Cheers,
> >>
> >> David / dhildenb
> >

Thanks
Barry