On Tue, Feb 27, 2024 at 8:42 PM Yin Fengwei <fengwei.yin@xxxxxxxxx> wrote: > > > > On 2/27/24 15:21, Barry Song wrote: > > On Tue, Feb 27, 2024 at 8:11 PM Barry Song <21cnbao@xxxxxxxxx> wrote: > >> > >> On Tue, Feb 27, 2024 at 8:02 PM Yin Fengwei <fengwei.yin@xxxxxxxxx> wrote: > >>> > >>> > >>> > >>> On 2/27/24 14:40, Barry Song wrote: > >>>> On Tue, Feb 27, 2024 at 7:14 PM Yin Fengwei <fengwei.yin@xxxxxxxxx> wrote: > >>>>> > >>>>> > >>>>> > >>>>> On 2/27/24 10:17, Barry Song wrote: > >>>>>>> Like if we hit folio which is partially mapped to the range, don't split it but > >>>>>>> just unmap the mapping part from the range. Let page reclaim decide whether > >>>>>>> split the large folio or not (If it's not mapped to any other range,it will be > >>>>>>> freed as whole large folio. If part of it still mapped to other range,page reclaim > >>>>>>> can decide whether to split it or ignore it for current reclaim cycle). > >>>>>> Yes, we can. but we still have to play the ptes check game to avoid adding > >>>>>> folios multiple times to reclaim the list. > >>>>>> > >>>>>> I don't see too much difference between splitting in madvise and splitting > >>>>>> in vmscan. as our real purpose is avoiding splitting entirely mapped > >>>>>> large folios. for partial mapped large folios, if we split in madvise, then > >>>>>> we don't need to play the game of skipping folios while iterating PTEs. > >>>>>> if we don't split in madvise, we have to make sure the large folio is only > >>>>>> added in reclaimed list one time by checking if PTEs belong to the > >>>>>> previous added folio. > >>>>> > >>>>> If the partial mapped large folio is unmapped from the range, the related PTE > >>>>> become none. How could the folio be added to reclaimed list multiple times? > >>>> > >>>> in case we have 16 PTEs in a large folio. > >>>> PTE0 present > >>>> PTE1 present > >>>> PTE2 present > >>>> PTE3 none > >>>> PTE4 present > >>>> PTE5 none > >>>> PTE6 present > >>>> .... > >>>> the current code is scanning PTE one by one. > >>>> while scanning PTE0, we have added the folio. then PTE1, PTE2, PTE4, PTE6... > >>> No. Before detect the folio is fully mapped to the range, we can't add folio > >>> to reclaim list because the partial mapped folio shouldn't be added. We can > >>> only scan PTE15 and know it's fully mapped. > >> > >> you never know PTE15 is the last one mapping to the large folio, PTE15 can > >> be mapping to a completely different folio with PTE0. > >> > >>> > >>> So, when scanning PTE0, we will not add folio. Then when hit PTE3, we know > >>> this is a partial mapped large folio. We will unmap it. Then all 16 PTEs > >>> become none. > >> > >> I don't understand why all 16PTEs become none as we set PTEs to none. > >> we set PTEs to swap entries till try_to_unmap_one called by vmscan. > >> > >>> > >>> If the large folio is fully mapped, the folio will be added to reclaim list > >>> after scan PTE15 and know it's fully mapped. > >> > >> our approach is calling pte_batch_pte while meeting the first pte, if > >> pte_batch_pte = 16, > >> then we add this folio to reclaim_list and skip the left 15 PTEs. > > > > Let's compare two different implementation, for partial mapped large folio > > with 8 PTEs as below, > > > > PTE0 present for large folio1 > > PTE1 present for large folio1 > > PTE2 present for another folio2 > > PTE3 present for another folio3 > > PTE4 present for large folio1 > > PTE5 present for large folio1 > > PTE6 present for another folio4 > > PTE7 present for another folio5 > > > > If we don't split in madvise(depend on vmscan to split after adding > > folio1), we will have > Let me clarify something here: > > I prefer that we don't split large folio here. Instead, we unmap the > large folio from this VMA range (I think you missed the unmap operation > I mentioned). I don't understand why we unmap as this is a MADV_PAGEOUT not an unmap. unmapping totally changes the semantics. Would you like to show pseudo code? for MADV_PAGEOUT on swap-out, the last step is writing swap entries to replace PTEs which are present. I don't understand how an unmap can be involved in this process. > > The intention is trying best to avoid splitting the large folio. If > the folio is only partially mapped to this VMA range, it's likely it > will be reclaimed as whole large folio. Which brings benefit for lru > and zone lock contention comparing to splitting large folio. which also brings negative side effects such as redundant I/O. For example, if you have only one subpage left in a large folio, pageout will still write nr_pages subpages into swap, then immediately free them in swap. > > The thing I am not sure is unmapping from specific VMA range is not > available and whether it's worthy to add it. I think we might have the possibility to have some complex code to add folio1, folio2, folio3, folio4 and folio5 in the above example into reclaim_list while avoiding splitting folio1. but i really don't understand how unmap will work. > > > to make sure folio1, folio2, folio3, folio4, folio5 are added to > > reclaim_list by doing a complex > > game while scanning these 8 PTEs. > > > > if we split in madvise, they become: > > > > PTE0 present for large folioA - splitted from folio 1 > > PTE1 present for large folioB - splitted from folio 1 > > PTE2 present for another folio2 > > PTE3 present for another folio3 > > PTE4 present for large folioC - splitted from folio 1 > > PTE5 present for large folioD - splitted from folio 1 > > PTE6 present for another folio4 > > PTE7 present for another folio5 > > > > we simply add the above 8 folios into reclaim_list one by one. > > > > I would vote for splitting for partial mapped large folio in madvise. > > Thanks Barry