Re: [PATCH v3 6/6] mm: swap: entirely map large folios found in swapcache

Barry Song <21cnbao@xxxxxxxxx> · Tue, 7 May 2024 10:58:25 +1200

On Tue, May 7, 2024 at 1:16 AM David Hildenbrand <david@xxxxxxxxxx> wrote:
>
> On 06.05.24 14:58, Barry Song wrote:
> > On Tue, May 7, 2024 at 12:38 AM Barry Song <21cnbao@xxxxxxxxx> wrote:
> >>
> >> On Tue, May 7, 2024 at 12:07 AM David Hildenbrand <david@xxxxxxxxxx> wrote:
> >>>
> >>> On 04.05.24 01:23, Barry Song wrote:
> >>>> On Fri, May 3, 2024 at 6:50 PM Ryan Roberts <ryan.roberts@xxxxxxx> wrote:
> >>>>>
> >>>>> On 03/05/2024 01:50, Barry Song wrote:
> >>>>>> From: Chuanhua Han <hanchuanhua@xxxxxxxx>
> >>>>>>
> >>>>>> When a large folio is found in the swapcache, the current implementation
> >>>>>> requires calling do_swap_page() nr_pages times, resulting in nr_pages
> >>>>>> page faults. This patch opts to map the entire large folio at once to
> >>>>>> minimize page faults. Additionally, redundant checks and early exits
> >>>>>> for ARM64 MTE restoring are removed.
> >>>>>>
> >>>>>> Signed-off-by: Chuanhua Han <hanchuanhua@xxxxxxxx>
> >>>>>> Co-developed-by: Barry Song <v-songbaohua@xxxxxxxx>
> >>>>>> Signed-off-by: Barry Song <v-songbaohua@xxxxxxxx>
> >>>>>
> >>>>> With the suggested changes below:
> >>>>>
> >>>>> Reviewed-by: Ryan Roberts <ryan.roberts@xxxxxxx>
> >>>>>
> >>>>>> ---
> >>>>>>    mm/memory.c | 60 ++++++++++++++++++++++++++++++++++++++++++-----------
> >>>>>>    1 file changed, 48 insertions(+), 12 deletions(-)
> >>>>>>
> >>>>>> diff --git a/mm/memory.c b/mm/memory.c
> >>>>>> index 22e7c33cc747..940fdbe69fa1 100644
> >>>>>> --- a/mm/memory.c
> >>>>>> +++ b/mm/memory.c
> >>>>>> @@ -3968,6 +3968,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >>>>>>         pte_t pte;
> >>>>>>         vm_fault_t ret = 0;
> >>>>>>         void *shadow = NULL;
> >>>>>> +     int nr_pages = 1;
> >>>>>> +     unsigned long page_idx = 0;
> >>>>>> +     unsigned long address = vmf->address;
> >>>>>> +     pte_t *ptep;
> >>>>>
> >>>>> nit: Personally I'd prefer all these to get initialised just before the "if
> >>>>> (folio_test_large()..." block below. That way it is clear they are fresh (incase
> >>>>> any logic between here and there makes an adjustment) and its clear that they
> >>>>> are only to be used after that block (the compiler will warn if using an
> >>>>> uninitialized value).
> >>>>
> >>>> right. I agree this will make the code more readable.
> >>>>
> >>>>>
> >>>>>>
> >>>>>>         if (!pte_unmap_same(vmf))
> >>>>>>                 goto out;
> >>>>>> @@ -4166,6 +4170,36 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >>>>>>                 goto out_nomap;
> >>>>>>         }
> >>>>>>
> >>>>>> +     ptep = vmf->pte;
> >>>>>> +     if (folio_test_large(folio) && folio_test_swapcache(folio)) {
> >>>>>> +             int nr = folio_nr_pages(folio);
> >>>>>> +             unsigned long idx = folio_page_idx(folio, page);
> >>>>>> +             unsigned long folio_start = vmf->address - idx * PAGE_SIZE;
> >>>>>> +             unsigned long folio_end = folio_start + nr * PAGE_SIZE;
> >>>>>> +             pte_t *folio_ptep;
> >>>>>> +             pte_t folio_pte;
> >>>>>> +
> >>>>>> +             if (unlikely(folio_start < max(vmf->address & PMD_MASK, vma->vm_start)))
> >>>>>> +                     goto check_folio;
> >>>>>> +             if (unlikely(folio_end > pmd_addr_end(vmf->address, vma->vm_end)))
> >>>>>> +                     goto check_folio;
> >>>>>> +
> >>>>>> +             folio_ptep = vmf->pte - idx;
> >>>>>> +             folio_pte = ptep_get(folio_ptep);
> >>>>>> +             if (!pte_same(folio_pte, pte_move_swp_offset(vmf->orig_pte, -idx)) ||
> >>>>>> +                 swap_pte_batch(folio_ptep, nr, folio_pte) != nr)
> >>>>>> +                     goto check_folio;
> >>>>>> +
> >>>>>> +             page_idx = idx;
> >>>>>> +             address = folio_start;
> >>>>>> +             ptep = folio_ptep;
> >>>>>> +             nr_pages = nr;
> >>>>>> +             entry = folio->swap;
> >>>>>> +             page = &folio->page;
> >>>>>> +     }
> >>>>>> +
> >>>>>> +check_folio:
> >>>>>
> >>>>> Is this still the correct label name, given the checks are now above the new
> >>>>> block? Perhaps "one_page" or something like that?
> >>>>
> >>>> not quite sure about this, as the code after one_page can be multiple_pages.
> >>>> On the other hand, it seems we are really checking folio after "check_folio"
> >>>> :-)
> >>>>
> >>>>
> >>>> BUG_ON(!folio_test_anon(folio) && folio_test_mappedtodisk(folio));
> >>>> BUG_ON(folio_test_anon(folio) && PageAnonExclusive(page));
> >>>>
> >>>> /*
> >>>> * Check under PT lock (to protect against concurrent fork() sharing
> >>>> * the swap entry concurrently) for certainly exclusive pages.
> >>>> */
> >>>> if (!folio_test_ksm(folio)) {
> >>>>
> >>>>
> >>>>>
> >>>>>> +
> >>>>>>         /*
> >>>>>>          * PG_anon_exclusive reuses PG_mappedtodisk for anon pages. A swap pte
> >>>>>>          * must never point at an anonymous page in the swapcache that is
> >>>>>> @@ -4225,12 +4259,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >>>>>>          * We're already holding a reference on the page but haven't mapped it
> >>>>>>          * yet.
> >>>>>>          */
> >>>>>> -     swap_free_nr(entry, 1);
> >>>>>> +     swap_free_nr(entry, nr_pages);
> >>>>>>         if (should_try_to_free_swap(folio, vma, vmf->flags))
> >>>>>>                 folio_free_swap(folio);
> >>>>>>
> >>>>>> -     inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
> >>>>>> -     dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
> >>>>>> +     folio_ref_add(folio, nr_pages - 1);
> >>>>>> +     add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
> >>>>>> +     add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
> >>>>>>         pte = mk_pte(page, vma->vm_page_prot);
> >>>>>>
> >>>>>>         /*
> >>>>>> @@ -4240,34 +4275,35 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >>>>>>          * exclusivity.
> >>>>>>          */
> >>>>>>         if (!folio_test_ksm(folio) &&
> >>>>>> -         (exclusive || folio_ref_count(folio) == 1)) {
> >>>>>> +         (exclusive || (folio_ref_count(folio) == nr_pages &&
> >>>>>> +                        folio_nr_pages(folio) == nr_pages))) {
> >>>>>
> >>>>> I think in practice there is no change here? If nr_pages > 1 then the folio is
> >>>>> in the swapcache, so there is an extra ref on it? I agree with the change for
> >>>>> robustness sake. Just checking my understanding.
> >>>>
> >>>> This is the code showing we are reusing/(mkwrite) a folio either
> >>>> 1. we meet a small folio and we are the only one hitting the small folio
> >>>> 2. we meet a large folio and we are the only one hitting the large folio
> >>>>
> >>>> any corner cases besides the above two seems difficult. for example,
> >>>>
> >>>> while we hit a large folio in swapcache but if we can't entirely map it
> >>>> (nr_pages==1) due to partial unmap, we will have folio_ref_count(folio)
> >>>> == nr_pages == 1
> >>>
> >>> No, there would be other references from the swapcache and
> >>> folio_ref_count(folio) > 1. See my other reply.
> >>
> >> right. can be clearer by:
> >
> > Wait, do we still need folio_nr_pages(folio) == nr_pages even if we use
> > folio_ref_count(folio) == 1 and moving folio_ref_add(folio, nr_pages - 1)?
>
> I don't think that we will "need" it.
>
> >
> > one case is that we have a large folio with 16 PTEs, and we unmap
> > 15 swap PTE entries, thus we have only one swap entry left. Then
> > we hit the large folio in swapcache.  but we have only one PTE thus we will
> > map only one PTE. lacking folio_nr_pages(folio) == nr_pages, we reuse the
> > large folio for one PTE. with it, do_wp_page() will migrate the large
> > folio to a small one?
>
> We will set PAE bit and do_wp_page() will unconditionally reuse that page.
>
> Note that this is the same as if we had pte_swp_exclusive() set and
> would have run into "exclusive=true" here.
>
> If we'd want a similar "optimization" as we have in
> wp_can_reuse_anon_folio(), you'd want something like
>
> exclusive || (folio_ref_count(folio) == 1 &&
>               (!folio_test_large(folio) || nr_pages > 1)

I feel like

A :   !folio_test_large(folio) || nr_pages > 1

equals

B:    folio_nr_pages(folio) == nr_pages

if folio is small,  folio_test_large(folio) is false, both A and B will be true;
if folio is large, and we map the whole large folio, A will be true
because of nr_pages > 1;
B is also true;
if folio is large, and we map single one PTE, A will be false;
B is also false, because nr_pages == 1 but  folio_nr_pages(folio) > 1;

right?

However, I agree that delving into this complexity might not be necessary
at the moment.

>
> ... but I am not sure if that is really worth the complexity here.
>
> >
> > 1AM, tired and sleepy. not quite sure I am correct.
> > I look forward to seeing your reply tomorrow morning :-)
>
> Heh, no need to dream about this ;)
>
> --
> Cheers,
>
> David / dhildenb

Thanks
Barry