On Tue, Feb 27, 2024 at 9:33 PM Yin Fengwei <fengwei.yin@xxxxxxxxx> wrote: > > > > On 2/27/24 15:54, Barry Song wrote: > > On Tue, Feb 27, 2024 at 8:42 PM Yin Fengwei <fengwei.yin@xxxxxxxxx> wrote: > >> > >> > >> > >> On 2/27/24 15:21, Barry Song wrote: > >>> On Tue, Feb 27, 2024 at 8:11 PM Barry Song <21cnbao@xxxxxxxxx> wrote: > >>>> > >>>> On Tue, Feb 27, 2024 at 8:02 PM Yin Fengwei <fengwei.yin@xxxxxxxxx> wrote: > >>>>> > >>>>> > >>>>> > >>>>> On 2/27/24 14:40, Barry Song wrote: > >>>>>> On Tue, Feb 27, 2024 at 7:14 PM Yin Fengwei <fengwei.yin@xxxxxxxxx> wrote: > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> On 2/27/24 10:17, Barry Song wrote: > >>>>>>>>> Like if we hit folio which is partially mapped to the range, don't split it but > >>>>>>>>> just unmap the mapping part from the range. Let page reclaim decide whether > >>>>>>>>> split the large folio or not (If it's not mapped to any other range,it will be > >>>>>>>>> freed as whole large folio. If part of it still mapped to other range,page reclaim > >>>>>>>>> can decide whether to split it or ignore it for current reclaim cycle). > >>>>>>>> Yes, we can. but we still have to play the ptes check game to avoid adding > >>>>>>>> folios multiple times to reclaim the list. > >>>>>>>> > >>>>>>>> I don't see too much difference between splitting in madvise and splitting > >>>>>>>> in vmscan. as our real purpose is avoiding splitting entirely mapped > >>>>>>>> large folios. for partial mapped large folios, if we split in madvise, then > >>>>>>>> we don't need to play the game of skipping folios while iterating PTEs. > >>>>>>>> if we don't split in madvise, we have to make sure the large folio is only > >>>>>>>> added in reclaimed list one time by checking if PTEs belong to the > >>>>>>>> previous added folio. > >>>>>>> > >>>>>>> If the partial mapped large folio is unmapped from the range, the related PTE > >>>>>>> become none. How could the folio be added to reclaimed list multiple times? > >>>>>> > >>>>>> in case we have 16 PTEs in a large folio. > >>>>>> PTE0 present > >>>>>> PTE1 present > >>>>>> PTE2 present > >>>>>> PTE3 none > >>>>>> PTE4 present > >>>>>> PTE5 none > >>>>>> PTE6 present > >>>>>> .... > >>>>>> the current code is scanning PTE one by one. > >>>>>> while scanning PTE0, we have added the folio. then PTE1, PTE2, PTE4, PTE6... > >>>>> No. Before detect the folio is fully mapped to the range, we can't add folio > >>>>> to reclaim list because the partial mapped folio shouldn't be added. We can > >>>>> only scan PTE15 and know it's fully mapped. > >>>> > >>>> you never know PTE15 is the last one mapping to the large folio, PTE15 can > >>>> be mapping to a completely different folio with PTE0. > >>>> > >>>>> > >>>>> So, when scanning PTE0, we will not add folio. Then when hit PTE3, we know > >>>>> this is a partial mapped large folio. We will unmap it. Then all 16 PTEs > >>>>> become none. > >>>> > >>>> I don't understand why all 16PTEs become none as we set PTEs to none. > >>>> we set PTEs to swap entries till try_to_unmap_one called by vmscan. > >>>> > >>>>> > >>>>> If the large folio is fully mapped, the folio will be added to reclaim list > >>>>> after scan PTE15 and know it's fully mapped. > >>>> > >>>> our approach is calling pte_batch_pte while meeting the first pte, if > >>>> pte_batch_pte = 16, > >>>> then we add this folio to reclaim_list and skip the left 15 PTEs. > >>> > >>> Let's compare two different implementation, for partial mapped large folio > >>> with 8 PTEs as below, > >>> > >>> PTE0 present for large folio1 > >>> PTE1 present for large folio1 > >>> PTE2 present for another folio2 > >>> PTE3 present for another folio3 > >>> PTE4 present for large folio1 > >>> PTE5 present for large folio1 > >>> PTE6 present for another folio4 > >>> PTE7 present for another folio5 > >>> > >>> If we don't split in madvise(depend on vmscan to split after adding > >>> folio1), we will have > >> Let me clarify something here: > >> > >> I prefer that we don't split large folio here. Instead, we unmap the > >> large folio from this VMA range (I think you missed the unmap operation > >> I mentioned). > > > > I don't understand why we unmap as this is a MADV_PAGEOUT not > > an unmap. unmapping totally changes the semantics. Would you like > > to show pseudo code? > Oh. Yes. MADV_PAGEOUT is not suitable. > > What about MADV_FREE? we can't unmap either. as MADV_FREE applies to anon vma. while a folio is marked lazyfree, we move anon folio to file LRU. if somebody writes the folio afterwards, we take the folio back; if nobody writes it before vmscan gets it in the file LRU, we can reclaim it by setting PTEs to none. we can't immediately unmap a large folio at the time MADV_FREE is called. immediate unmap is the behavior of MADV_DONTNEED but not MADV_FREE. > > > > > for MADV_PAGEOUT on swap-out, the last step is writing swap entries > > to replace PTEs which are present. I don't understand how an unmap > > can be involved in this process. > > > >> > >> The intention is trying best to avoid splitting the large folio. If > >> the folio is only partially mapped to this VMA range, it's likely it > >> will be reclaimed as whole large folio. Which brings benefit for lru > >> and zone lock contention comparing to splitting large folio. > > > > which also brings negative side effects such as redundant I/O. > > For example, if you have only one subpage left in a large folio, > > pageout will still write nr_pages subpages into swap, then immediately > > free them in swap. > > > >> > >> The thing I am not sure is unmapping from specific VMA range is not > >> available and whether it's worthy to add it. > > > > I think we might have the possibility to have some complex code to > > add folio1, folio2, folio3, folio4 and folio5 in the above example into > > reclaim_list while avoiding splitting folio1. but i really don't understand > > how unmap will work. > > > >> > >>> to make sure folio1, folio2, folio3, folio4, folio5 are added to > >>> reclaim_list by doing a complex > >>> game while scanning these 8 PTEs. > >>> > >>> if we split in madvise, they become: > >>> > >>> PTE0 present for large folioA - splitted from folio 1 > >>> PTE1 present for large folioB - splitted from folio 1 > >>> PTE2 present for another folio2 > >>> PTE3 present for another folio3 > >>> PTE4 present for large folioC - splitted from folio 1 > >>> PTE5 present for large folioD - splitted from folio 1 > >>> PTE6 present for another folio4 > >>> PTE7 present for another folio5 > >>> > >>> we simply add the above 8 folios into reclaim_list one by one. > >>> > >>> I would vote for splitting for partial mapped large folio in madvise. > >>> > > Thanks Barry