Re: All MADV_FREE mTHPs are fully subjected to deferred_split_folio()

Lance Yang <ioworker0@xxxxxxxxx> · Tue, 31 Dec 2024 00:02:20 +0800

On Mon, Dec 30, 2024 at 8:52 PM David Hildenbrand <david@xxxxxxxxxx> wrote:
>
> On 30.12.24 12:54, Barry Song wrote:
> > On Mon, Dec 30, 2024 at 10:48 PM David Hildenbrand <david@xxxxxxxxxx> wrote:
> >>
> >> On 30.12.24 03:14, Lance Yang wrote:
> >>> Hi Barry,
> >>>
> >>> On Mon, Dec 30, 2024 at 5:13 AM Barry Song <21cnbao@xxxxxxxxx> wrote:
> >>>>
> >>>> Hi Lance,
> >>>>
> >>>> Along with Ryan, David, Baolin, and anyone else who might be interested,
> >>>>
> >>>> We’ve noticed an unexpectedly high number of deferred splits. The root
> >>>> cause appears to be the changes introduced in commit dce7d10be4bbd3
> >>>> ("mm/madvise: optimize lazyfreeing with mTHP in madvise_free"). Since
> >>>> that commit, split_folio is no longer called in mm/madvise.c.
> >>
> >> Hi,
> >>
> >> I assume you don't see "deferred splits" at all. You see that a folio
> >> was added to the deferred split queue to immediately be removed again as
> >> it gets freed. Correct?
> >>
> >>>>
> >>>> However, we are still performing deferred_split_folio for all
> >>>> MADV_FREE mTHPs, even for those that are fully aligned with mTHP.
> >>>> This happens because we execute a goto discard in
> >>>> try_to_unmap_one(), which eventually leads to
> >>>> folio_remove_rmap_pte() adding all folios to deferred_split when we
> >>>> scan the 1st pte in try_to_unmap_one().
> >>>>
> >>>> discard:
> >>>>                   if (unlikely(folio_test_hugetlb(folio)))
> >>>>                           hugetlb_remove_rmap(folio);
> >>>>                   else
> >>>>                           folio_remove_rmap_pte(folio, subpage, vma);
> >>
> >> Yes, that's kind-of know: we neither do PTE batching during unmap for
> >> reclaim nor during unmap for migration. We should add that support.
> >>
> >> But note, just like I raised earlier in the context of similar to
> >> "improved partial-mapped logic in rmap code when batching", we are
> >> primarily only pleasing counters here.
> >>
> >> See below on concurrent shrinker.
> >>
> >>>>
> >>>> This could lead to a race condition with shrinker - deferred_split_scan().
> >>>> The shrinker might call folio_try_get(folio), and while we are scanning
> >>>> the second PTE of this folio in try_to_unmap_one(), the entire mTHP
> >>>> could be transitioned back to swap-backed because the reference count
> >>>> is incremented.
> >>>>
> >>>>                                   /*
> >>>>                                    * The only page refs must be one from isolation
> >>>>                                    * plus the rmap(s) (dropped by discard:).
> >>>>                                    */
> >>>>                                   if (ref_count == 1 + map_count &&
> >>>>                                       (!folio_test_dirty(folio) ||
> >>>>                                        ...
> >>>>                                        (vma->vm_flags & VM_DROPPABLE))) {
> >>>>                                           dec_mm_counter(mm, MM_ANONPAGES);
> >>>>                                           goto discard;
> >>>>                                   }
> >>
> >>
> >> Reclaim code holds an additional folio reference and has the folio
> >> locked. So I don't think this race can really happen in the way you
> >> think it could? Please feel free to correct me if I am wrong.
> >
> > try_to_unmap_one will only execute "goto discard" and remove the rmap if
> > ref_count == 1 + map_count. An additional ref_count + 1 from the shrinker
> > can invalidate this condition, leading to the restoration of the PTE and setting
> > the folio as swap-backed.
> >
> >                                 /*
> >                                   * The only page refs must be one from isolation
> >                                   * plus the rmap(s) (dropped by discard:).
> >                                   */
> >                                  if (ref_count == 1 + map_count &&
> >                                      (!folio_test_dirty(folio) ||
> >                                       /*
> >                                        * Unlike MADV_FREE mappings, VM_DROPPABLE
> >                                        * ones can be dropped even if they've
> >                                        * been dirtied.
> >                                        */
> >                                       (vma->vm_flags & VM_DROPPABLE))) {
> >                                          dec_mm_counter(mm, MM_ANONPAGES);
> >                                          goto discard;
> >                                  }
> >
> >                                  /*
> >                                   * If the folio was redirtied, it cannot be
> >                                   * discarded. Remap the page to page table.
> >                                   */
> >                                  set_pte_at(mm, address, pvmw.pte, pteval);
> >                                  /*
> >                                   * Unlike MADV_FREE mappings, VM_DROPPABLE ones
> >                                   * never get swap backed on failure to drop.
> >                                   */
> >                                  if (!(vma->vm_flags & VM_DROPPABLE))
> >                                          folio_set_swapbacked(folio);
> >                                  goto walk_abort;
>
> Ah, that's what you mean. Yes, but the shrinker behaves mostly like just
> any other speculative reference.
>
> So we're not actually handling speculative references here correctly, so
> this issue is not completely shrinker-specific.
>
> Maybe, we should be doing something like this?
>
> /*
>   * Unlike MADV_FREE mappings, VM_DROPPABLE ones can be dropped even if
>   * they've been dirtied.
>   */
> if (folio_test_dirty(folio) && !(vma->vm_flags & VM_DROPPABLE)) {
>         /*
>          * redirtied either using the page table or a previously
>          * obtained GUP reference.
>          */
>         set_pte_at(mm, address, pvmw.pte, pteval);
>         folio_set_swapbacked(folio);
>         goto walk_abort;
> } else if (ref_count != 1 + map_count) {
>         /*
>          * Additional reference. Could be a GUP reference or any
>          * speculative reference. GUP users must mark the folio dirty if
>          * there was a modification. This folio cannot be reclaimed
>          * right now either way, so act just like nothing happened.
>          * We'll come back here later and detect if the folio was
>          * dirtied when the additional reference is gone.
>          */
>         set_pte_at(mm, address, pvmw.pte, pteval);
>         goto walk_abort;
> }
> goto discard;
>
>
> Probably cleaning up goto labels.

Nice! It looks simple and easy to understand ;)

It might be reasonable to temporarily ignore the folio and check
later whether it was marked dirty after the reference is released,
as the additional reference could be a GUP or another speculative
reference. Personally, I also prefer this approach.

Thanks,
Lance

>
>
> --
> Cheers,
>
> David / dhildenb
>