Re: [PATCH v3 6/6] mm: swap: entirely map large folios found in swapcache

Barry Song <21cnbao@xxxxxxxxx> · Tue, 7 May 2024 00:27:02 +1200

On Tue, May 7, 2024 at 12:05 AM David Hildenbrand <david@xxxxxxxxxx> wrote:
>
> On 03.05.24 02:50, Barry Song wrote:
> > From: Chuanhua Han <hanchuanhua@xxxxxxxx>
> >
> > When a large folio is found in the swapcache, the current implementation
> > requires calling do_swap_page() nr_pages times, resulting in nr_pages
> > page faults. This patch opts to map the entire large folio at once to
> > minimize page faults. Additionally, redundant checks and early exits
> > for ARM64 MTE restoring are removed.
> >
> > Signed-off-by: Chuanhua Han <hanchuanhua@xxxxxxxx>
> > Co-developed-by: Barry Song <v-songbaohua@xxxxxxxx>
> > Signed-off-by: Barry Song <v-songbaohua@xxxxxxxx>
> > ---
> >   mm/memory.c | 60 ++++++++++++++++++++++++++++++++++++++++++-----------
> >   1 file changed, 48 insertions(+), 12 deletions(-)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 22e7c33cc747..940fdbe69fa1 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -3968,6 +3968,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >       pte_t pte;
> >       vm_fault_t ret = 0;
> >       void *shadow = NULL;
> > +     int nr_pages = 1;
> > +     unsigned long page_idx = 0;
> > +     unsigned long address = vmf->address;
> > +     pte_t *ptep;
> >
> >       if (!pte_unmap_same(vmf))
> >               goto out;
> > @@ -4166,6 +4170,36 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >               goto out_nomap;
> >       }
> >
> > +     ptep = vmf->pte;
> > +     if (folio_test_large(folio) && folio_test_swapcache(folio)) {
> > +             int nr = folio_nr_pages(folio);
> > +             unsigned long idx = folio_page_idx(folio, page);
> > +             unsigned long folio_start = vmf->address - idx * PAGE_SIZE;
> > +             unsigned long folio_end = folio_start + nr * PAGE_SIZE;
> > +             pte_t *folio_ptep;
> > +             pte_t folio_pte;
> > +
> > +             if (unlikely(folio_start < max(vmf->address & PMD_MASK, vma->vm_start)))
> > +                     goto check_folio;
> > +             if (unlikely(folio_end > pmd_addr_end(vmf->address, vma->vm_end)))
> > +                     goto check_folio;
> > +
> > +             folio_ptep = vmf->pte - idx;
> > +             folio_pte = ptep_get(folio_ptep);
> > +             if (!pte_same(folio_pte, pte_move_swp_offset(vmf->orig_pte, -idx)) ||
> > +                 swap_pte_batch(folio_ptep, nr, folio_pte) != nr)
> > +                     goto check_folio;
> > +
> > +             page_idx = idx;
> > +             address = folio_start;
> > +             ptep = folio_ptep;
> > +             nr_pages = nr;
> > +             entry = folio->swap;
> > +             page = &folio->page;
> > +     }
> > +
> > +check_folio:
> > +
> >       /*
> >        * PG_anon_exclusive reuses PG_mappedtodisk for anon pages. A swap pte
> >        * must never point at an anonymous page in the swapcache that is
> > @@ -4225,12 +4259,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >        * We're already holding a reference on the page but haven't mapped it
> >        * yet.
> >        */
> > -     swap_free_nr(entry, 1);
> > +     swap_free_nr(entry, nr_pages);
> >       if (should_try_to_free_swap(folio, vma, vmf->flags))
> >               folio_free_swap(folio);
> >
> > -     inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
> > -     dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
> > +     folio_ref_add(folio, nr_pages - 1);
> > +     add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
> > +     add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
> >       pte = mk_pte(page, vma->vm_page_prot);
> >
> >       /*
> > @@ -4240,34 +4275,35 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >        * exclusivity.
> >        */
> >       if (!folio_test_ksm(folio) &&
> > -         (exclusive || folio_ref_count(folio) == 1)) {
> > +         (exclusive || (folio_ref_count(folio) == nr_pages &&
> > +                        folio_nr_pages(folio) == nr_pages))) {
> >               if (vmf->flags & FAULT_FLAG_WRITE) {
> >                       pte = maybe_mkwrite(pte_mkdirty(pte), vma);
> >                       vmf->flags &= ~FAULT_FLAG_WRITE;
>
> I fail to convince myself that this change is correct, and if it is
> correct, it's confusing (I think there is a dependency on
> folio_free_swap() having been called and succeeding, such that we don't
> have a folio that is in the swapcache at this point).
>
> Why can't we move the folio_ref_add() after this check and just leave
> the check as it is?
>
> "folio_ref_count(folio) == 1" is as clear as it gets: we hold the single
> reference, so we can do with this thing whatever we want: it's certainly
> exclusive. No swapcache, no other people mapping it.

Right.
I believe the code works correctly but is a bit confusing. as you said,
we might move folio_ref_add() behind folio_ref_count(folio) == 1.

>
>
> --
> Cheers,
>
> David / dhildenb
>

Thanks
Barry