On 2/27/24 15:54, Barry Song wrote: > On Tue, Feb 27, 2024 at 8:42 PM Yin Fengwei <fengwei.yin@xxxxxxxxx> wrote: >> >> >> >> On 2/27/24 15:21, Barry Song wrote: >>> On Tue, Feb 27, 2024 at 8:11 PM Barry Song <21cnbao@xxxxxxxxx> wrote: >>>> >>>> On Tue, Feb 27, 2024 at 8:02 PM Yin Fengwei <fengwei.yin@xxxxxxxxx> wrote: >>>>> >>>>> >>>>> >>>>> On 2/27/24 14:40, Barry Song wrote: >>>>>> On Tue, Feb 27, 2024 at 7:14 PM Yin Fengwei <fengwei.yin@xxxxxxxxx> wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> On 2/27/24 10:17, Barry Song wrote: >>>>>>>>> Like if we hit folio which is partially mapped to the range, don't split it but >>>>>>>>> just unmap the mapping part from the range. Let page reclaim decide whether >>>>>>>>> split the large folio or not (If it's not mapped to any other range,it will be >>>>>>>>> freed as whole large folio. If part of it still mapped to other range,page reclaim >>>>>>>>> can decide whether to split it or ignore it for current reclaim cycle). >>>>>>>> Yes, we can. but we still have to play the ptes check game to avoid adding >>>>>>>> folios multiple times to reclaim the list. >>>>>>>> >>>>>>>> I don't see too much difference between splitting in madvise and splitting >>>>>>>> in vmscan. as our real purpose is avoiding splitting entirely mapped >>>>>>>> large folios. for partial mapped large folios, if we split in madvise, then >>>>>>>> we don't need to play the game of skipping folios while iterating PTEs. >>>>>>>> if we don't split in madvise, we have to make sure the large folio is only >>>>>>>> added in reclaimed list one time by checking if PTEs belong to the >>>>>>>> previous added folio. >>>>>>> >>>>>>> If the partial mapped large folio is unmapped from the range, the related PTE >>>>>>> become none. How could the folio be added to reclaimed list multiple times? >>>>>> >>>>>> in case we have 16 PTEs in a large folio. >>>>>> PTE0 present >>>>>> PTE1 present >>>>>> PTE2 present >>>>>> PTE3 none >>>>>> PTE4 present >>>>>> PTE5 none >>>>>> PTE6 present >>>>>> .... >>>>>> the current code is scanning PTE one by one. >>>>>> while scanning PTE0, we have added the folio. then PTE1, PTE2, PTE4, PTE6... >>>>> No. Before detect the folio is fully mapped to the range, we can't add folio >>>>> to reclaim list because the partial mapped folio shouldn't be added. We can >>>>> only scan PTE15 and know it's fully mapped. >>>> >>>> you never know PTE15 is the last one mapping to the large folio, PTE15 can >>>> be mapping to a completely different folio with PTE0. >>>> >>>>> >>>>> So, when scanning PTE0, we will not add folio. Then when hit PTE3, we know >>>>> this is a partial mapped large folio. We will unmap it. Then all 16 PTEs >>>>> become none. >>>> >>>> I don't understand why all 16PTEs become none as we set PTEs to none. >>>> we set PTEs to swap entries till try_to_unmap_one called by vmscan. >>>> >>>>> >>>>> If the large folio is fully mapped, the folio will be added to reclaim list >>>>> after scan PTE15 and know it's fully mapped. >>>> >>>> our approach is calling pte_batch_pte while meeting the first pte, if >>>> pte_batch_pte = 16, >>>> then we add this folio to reclaim_list and skip the left 15 PTEs. >>> >>> Let's compare two different implementation, for partial mapped large folio >>> with 8 PTEs as below, >>> >>> PTE0 present for large folio1 >>> PTE1 present for large folio1 >>> PTE2 present for another folio2 >>> PTE3 present for another folio3 >>> PTE4 present for large folio1 >>> PTE5 present for large folio1 >>> PTE6 present for another folio4 >>> PTE7 present for another folio5 >>> >>> If we don't split in madvise(depend on vmscan to split after adding >>> folio1), we will have >> Let me clarify something here: >> >> I prefer that we don't split large folio here. Instead, we unmap the >> large folio from this VMA range (I think you missed the unmap operation >> I mentioned). > > I don't understand why we unmap as this is a MADV_PAGEOUT not > an unmap. unmapping totally changes the semantics. Would you like > to show pseudo code? Oh. Yes. MADV_PAGEOUT is not suitable. What about MADV_FREE? > > for MADV_PAGEOUT on swap-out, the last step is writing swap entries > to replace PTEs which are present. I don't understand how an unmap > can be involved in this process. > >> >> The intention is trying best to avoid splitting the large folio. If >> the folio is only partially mapped to this VMA range, it's likely it >> will be reclaimed as whole large folio. Which brings benefit for lru >> and zone lock contention comparing to splitting large folio. > > which also brings negative side effects such as redundant I/O. > For example, if you have only one subpage left in a large folio, > pageout will still write nr_pages subpages into swap, then immediately > free them in swap. > >> >> The thing I am not sure is unmapping from specific VMA range is not >> available and whether it's worthy to add it. > > I think we might have the possibility to have some complex code to > add folio1, folio2, folio3, folio4 and folio5 in the above example into > reclaim_list while avoiding splitting folio1. but i really don't understand > how unmap will work. > >> >>> to make sure folio1, folio2, folio3, folio4, folio5 are added to >>> reclaim_list by doing a complex >>> game while scanning these 8 PTEs. >>> >>> if we split in madvise, they become: >>> >>> PTE0 present for large folioA - splitted from folio 1 >>> PTE1 present for large folioB - splitted from folio 1 >>> PTE2 present for another folio2 >>> PTE3 present for another folio3 >>> PTE4 present for large folioC - splitted from folio 1 >>> PTE5 present for large folioD - splitted from folio 1 >>> PTE6 present for another folio4 >>> PTE7 present for another folio5 >>> >>> we simply add the above 8 folios into reclaim_list one by one. >>> >>> I would vote for splitting for partial mapped large folio in madvise. >>> > > Thanks > Barry