Re: [PATCH 1/1] mm/madvise: enhance lazyfreeing with mTHP in madvise_free

Barry Song <21cnbao@xxxxxxxxx> · Tue, 27 Feb 2024 22:01:00 +1300

On Tue, Feb 27, 2024 at 9:33 PM Yin Fengwei <fengwei.yin@xxxxxxxxx> wrote:
>
>
>
> On 2/27/24 15:54, Barry Song wrote:
> > On Tue, Feb 27, 2024 at 8:42 PM Yin Fengwei <fengwei.yin@xxxxxxxxx> wrote:
> >>
> >>
> >>
> >> On 2/27/24 15:21, Barry Song wrote:
> >>> On Tue, Feb 27, 2024 at 8:11 PM Barry Song <21cnbao@xxxxxxxxx> wrote:
> >>>>
> >>>> On Tue, Feb 27, 2024 at 8:02 PM Yin Fengwei <fengwei.yin@xxxxxxxxx> wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 2/27/24 14:40, Barry Song wrote:
> >>>>>> On Tue, Feb 27, 2024 at 7:14 PM Yin Fengwei <fengwei.yin@xxxxxxxxx> wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On 2/27/24 10:17, Barry Song wrote:
> >>>>>>>>> Like if we hit folio which is partially mapped to the range, don't split it but
> >>>>>>>>> just unmap the mapping part from the range. Let page reclaim decide whether
> >>>>>>>>> split the large folio or not (If it's not mapped to any other range,it will be
> >>>>>>>>> freed as whole large folio. If part of it still mapped to other range,page reclaim
> >>>>>>>>> can decide whether to split it or ignore it for current reclaim cycle).
> >>>>>>>> Yes, we can. but we still have to play the ptes check game to avoid adding
> >>>>>>>> folios multiple times to reclaim the list.
> >>>>>>>>
> >>>>>>>> I don't see too much difference between splitting in madvise and splitting
> >>>>>>>> in vmscan.  as our real purpose is avoiding splitting entirely mapped
> >>>>>>>> large folios. for partial mapped large folios, if we split in madvise, then
> >>>>>>>> we don't need to play the game of skipping folios while iterating PTEs.
> >>>>>>>> if we don't split in madvise, we have to make sure the large folio is only
> >>>>>>>> added in reclaimed list one time by checking if PTEs belong to the
> >>>>>>>> previous added folio.
> >>>>>>>
> >>>>>>> If the partial mapped large folio is unmapped from the range, the related PTE
> >>>>>>> become none. How could the folio be added to reclaimed list multiple times?
> >>>>>>
> >>>>>> in case we have 16 PTEs in a large folio.
> >>>>>> PTE0 present
> >>>>>> PTE1 present
> >>>>>> PTE2 present
> >>>>>> PTE3  none
> >>>>>> PTE4 present
> >>>>>> PTE5 none
> >>>>>> PTE6 present
> >>>>>> ....
> >>>>>> the current code is scanning PTE one by one.
> >>>>>> while scanning PTE0, we have added the folio. then PTE1, PTE2, PTE4, PTE6...
> >>>>> No. Before detect the folio is fully mapped to the range, we can't add folio
> >>>>> to reclaim list because the partial mapped folio shouldn't be added. We can
> >>>>> only scan PTE15 and know it's fully mapped.
> >>>>
> >>>> you never know PTE15 is the last one mapping to the large folio, PTE15 can
> >>>> be mapping to a completely different folio with PTE0.
> >>>>
> >>>>>
> >>>>> So, when scanning PTE0, we will not add folio. Then when hit PTE3, we know
> >>>>> this is a partial mapped large folio. We will unmap it. Then all 16 PTEs
> >>>>> become none.
> >>>>
> >>>> I don't understand why all 16PTEs become none as we set PTEs to none.
> >>>> we set PTEs to swap entries till try_to_unmap_one called by vmscan.
> >>>>
> >>>>>
> >>>>> If the large folio is fully mapped, the folio will be added to reclaim list
> >>>>> after scan PTE15 and know it's fully mapped.
> >>>>
> >>>> our approach is calling pte_batch_pte while meeting the first pte, if
> >>>> pte_batch_pte = 16,
> >>>> then we add this folio to reclaim_list and skip the left 15 PTEs.
> >>>
> >>> Let's compare two different implementation, for partial mapped large folio
> >>> with 8 PTEs as below,
> >>>
> >>> PTE0 present for large folio1
> >>> PTE1 present for large folio1
> >>> PTE2 present for another folio2
> >>> PTE3 present for another folio3
> >>> PTE4 present for large folio1
> >>> PTE5 present for large folio1
> >>> PTE6 present for another folio4
> >>> PTE7 present for another folio5
> >>>
> >>> If we don't split in madvise(depend on vmscan to split after adding
> >>> folio1), we will have
> >> Let me clarify something here:
> >>
> >> I prefer that we don't split large folio here. Instead, we unmap the
> >> large folio from this VMA range (I think you missed the unmap operation
> >> I mentioned).
> >
> > I don't understand why we unmap as this is a MADV_PAGEOUT not
> > an unmap. unmapping totally changes the semantics. Would you like
> > to show pseudo code?
> Oh. Yes. MADV_PAGEOUT is not suitable.
>
> What about MADV_FREE?

we can't unmap either. as MADV_FREE applies to anon vma.  while a
folio is marked lazyfree, we move anon folio to file LRU. if somebody
writes the folio afterwards, we take the folio back; if nobody writes it
before vmscan gets it in the file LRU, we can reclaim it by setting PTEs
to none. we can't immediately unmap a large folio at the time
MADV_FREE is called.  immediate unmap is the behavior of MADV_DONTNEED
but not MADV_FREE.

>
> >
> > for MADV_PAGEOUT on swap-out, the last step is writing swap entries
> > to replace PTEs which are present. I don't understand how an unmap
> > can be involved in this process.
> >
> >>
> >> The intention is trying best to avoid splitting the large folio. If
> >> the folio is only partially mapped to this VMA range, it's likely it
> >> will be reclaimed as whole large folio. Which brings benefit for lru
> >> and zone lock contention comparing to splitting large folio.
> >
> > which also brings negative side effects such as redundant I/O.
> > For example, if you have only one subpage left in a large folio,
> > pageout will still write nr_pages subpages into swap, then immediately
> > free them in swap.
> >
> >>
> >> The thing I am not sure is unmapping from specific VMA range is not
> >> available and whether it's worthy to add it.
> >
> > I think we might have the possibility to have some complex code to
> > add folio1, folio2, folio3, folio4 and folio5 in the above example into
> > reclaim_list while avoiding splitting folio1. but i really don't understand
> > how unmap will work.
> >
> >>
> >>> to make sure folio1, folio2, folio3, folio4, folio5 are added to
> >>> reclaim_list by doing a complex
> >>> game while scanning these 8 PTEs.
> >>>
> >>> if we split in madvise, they become:
> >>>
> >>> PTE0 present for large folioA  - splitted from folio 1
> >>> PTE1 present for large folioB - splitted from folio 1
> >>> PTE2 present for another folio2
> >>> PTE3 present for another folio3
> >>> PTE4 present for large folioC - splitted from folio 1
> >>> PTE5 present for large folioD - splitted from folio 1
> >>> PTE6 present for another folio4
> >>> PTE7 present for another folio5
> >>>
> >>> we simply add the above 8 folios into reclaim_list one by one.
> >>>
> >>> I would vote for splitting for partial mapped large folio in madvise.
> >>>
> >

Thanks
Barry