Re: [PATCH v6 2/3] mm/rmap: integrate PMD-mapped folio splitting into pagewalk loop

Lance Yang <ioworker0@xxxxxxxxx> · Wed, 5 Jun 2024 23:43:06 +0800

On Wed, Jun 5, 2024 at 11:03 PM David Hildenbrand <david@xxxxxxxxxx> wrote:
>
> On 05.06.24 16:57, Lance Yang wrote:
> > On Wed, Jun 5, 2024 at 10:39 PM David Hildenbrand <david@xxxxxxxxxx> wrote:
> >>
> >> On 05.06.24 16:28, David Hildenbrand wrote:
> >>> On 05.06.24 16:20, Lance Yang wrote:
> >>>> Hi David,
> >>>>
> >>>> On Wed, Jun 5, 2024 at 8:46 PM David Hildenbrand <david@xxxxxxxxxx> wrote:
> >>>>>
> >>>>> On 21.05.24 06:02, Lance Yang wrote:
> >>>>>> In preparation for supporting try_to_unmap_one() to unmap PMD-mapped
> >>>>>> folios, start the pagewalk first, then call split_huge_pmd_address() to
> >>>>>> split the folio.
> >>>>>>
> >>>>>> Since TTU_SPLIT_HUGE_PMD will no longer perform immediately, we might
> >>>>>> encounter a PMD-mapped THP missing the mlock in the VM_LOCKED range during
> >>>>>> the page walk. It’s probably necessary to mlock this THP to prevent it from
> >>>>>> being picked up during page reclaim.
> >>>>>>
> >>>>>> Suggested-by: David Hildenbrand <david@xxxxxxxxxx>
> >>>>>> Suggested-by: Baolin Wang <baolin.wang@xxxxxxxxxxxxxxxxx>
> >>>>>> Signed-off-by: Lance Yang <ioworker0@xxxxxxxxx>
> >>>>>> ---
> >>>>>
> >>>>> [...] again, sorry for the late review.
> >>>>
> >>>> No worries at all, thanks for taking time to review!
> >>>>
> >>>>>
> >>>>>> diff --git a/mm/rmap.c b/mm/rmap.c
> >>>>>> index ddffa30c79fb..08a93347f283 100644
> >>>>>> --- a/mm/rmap.c
> >>>>>> +++ b/mm/rmap.c
> >>>>>> @@ -1640,9 +1640,6 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> >>>>>>          if (flags & TTU_SYNC)
> >>>>>>                  pvmw.flags = PVMW_SYNC;
> >>>>>>
> >>>>>> -     if (flags & TTU_SPLIT_HUGE_PMD)
> >>>>>> -             split_huge_pmd_address(vma, address, false, folio);
> >>>>>> -
> >>>>>>          /*
> >>>>>>           * For THP, we have to assume the worse case ie pmd for invalidation.
> >>>>>>           * For hugetlb, it could be much worse if we need to do pud
> >>>>>> @@ -1668,20 +1665,35 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> >>>>>>          mmu_notifier_invalidate_range_start(&range);
> >>>>>>
> >>>>>>          while (page_vma_mapped_walk(&pvmw)) {
> >>>>>> -             /* Unexpected PMD-mapped THP? */
> >>>>>> -             VM_BUG_ON_FOLIO(!pvmw.pte, folio);
> >>>>>> -
> >>>>>>                  /*
> >>>>>>                   * If the folio is in an mlock()d vma, we must not swap it out.
> >>>>>>                   */
> >>>>>>                  if (!(flags & TTU_IGNORE_MLOCK) &&
> >>>>>>                      (vma->vm_flags & VM_LOCKED)) {
> >>>>>>                          /* Restore the mlock which got missed */
> >>>>>> -                     if (!folio_test_large(folio))
> >>>>>> +                     if (!folio_test_large(folio) ||
> >>>>>> +                         (!pvmw.pte && (flags & TTU_SPLIT_HUGE_PMD)))
> >>>>>>                                  mlock_vma_folio(folio, vma);
> >>>>>
> >>>>> Can you elaborate why you think this would be required? If we would have
> >>>>> performed the  split_huge_pmd_address() beforehand, we would still be
> >>>>> left with a large folio, no?
> >>>>
> >>>> Yep, there would still be a large folio, but it wouldn't be PMD-mapped.
> >>>>
> >>>> After Weifeng's series[1], the kernel supports mlock for PTE-mapped large
> >>>> folio, but there are a few scenarios where we don't mlock a large folio, such
> >>>> as when it crosses a VM_LOCKed VMA boundary.
> >>>>
> >>>>     -                     if (!folio_test_large(folio))
> >>>>     +                     if (!folio_test_large(folio) ||
> >>>>     +                         (!pvmw.pte && (flags & TTU_SPLIT_HUGE_PMD)))
> >>>>
> >>>> And this check is just future-proofing and likely unnecessary. If encountering a
> >>>> PMD-mapped THP missing the mlock for some reason, we can mlock this
> >>>> THP to prevent it from being picked up during page reclaim, since it is fully
> >>>> mapped and doesn't cross the VMA boundary, IIUC.
> >>>>
> >>>> What do you think?
> >>>> I would appreciate any suggestions regarding this check ;)
> >>>
> >>> Reading this patch only, I wonder if this change makes sense in the
> >>> context here.
> >>>
> >>> Before this patch, we would have PTE-mapped the PMD-mapped THP before
> >>> reaching this call and skipped it due to "!folio_test_large(folio)".
> >>>
> >>> After this patch, we either
> >>>
> >>> a) PTE-remap the THP after this check, but retry and end-up here again,
> >>> whereby we would skip it due to "!folio_test_large(folio)".
> >>>
> >>> b) Discard the PMD-mapped THP due to lazyfree directly. Can that
> >>> co-exist with mlock and what would be the problem here with mlock?
> >>>
> >>>
> >
> > Thanks a lot for clarifying!
> >
> >>> So if the check is required in this patch, we really have to understand
> >>> why. If not, we should better drop it from this patch.
> >>>
> >>> At least my opinion, still struggling to understand why it would be
> >>> required (I have 0 knowledge about mlock interaction with large folios :) ).
> >>>
> >>
> >> Looking at that series, in folio_references_one(), we do
> >>
> >>                          if (!folio_test_large(folio) || !pvmw.pte) {
> >>                                  /* Restore the mlock which got missed */
> >>                                  mlock_vma_folio(folio, vma);
> >>                                  page_vma_mapped_walk_done(&pvmw);
> >>                                  pra->vm_flags |= VM_LOCKED;
> >>                                  return false; /* To break the loop */
> >>                          }
> >>
> >> I wonder if we want that here as well now: in case of lazyfree we
> >> would not back off, right?
> >>
> >> But I'm not sure if lazyfree in mlocked areas are even possible.
> >>
> >> Adding the "!pvmw.pte" would be much clearer to me than the flag check.
> >
> > Hmm... How about we drop it from this patch for now, and add it back if needed
> > in the future?
>
> If we can rule out that MADV_FREE + mlock() keeps working as expected in
> the PMD-mapped case, we're good.
>
> Can we rule that out? (especially for MADV_FREE followed by mlock())

Perhaps we don't worry about that.

IIUC, without that check, MADV_FREE + mlock() still works as expected in
the PMD-mapped case, since if encountering a large folio in a VM_LOCKED
VMA range, we will stop the page walk immediately.

Thanks,
Lance

>
> --
> Cheers,
>
> David / dhildenb
>