On Wed, Mar 13, 2024 at 10:36 PM Ryan Roberts <ryan.roberts@xxxxxxx> wrote: > > On 13/03/2024 09:16, Barry Song wrote: > > On Wed, Mar 13, 2024 at 10:03 PM Ryan Roberts <ryan.roberts@xxxxxxx> wrote: > >> > >> On 13/03/2024 07:19, Barry Song wrote: > >>> On Tue, Mar 12, 2024 at 4:01 AM Ryan Roberts <ryan.roberts@xxxxxxx> wrote: > >>>> > >>>> Rework madvise_cold_or_pageout_pte_range() to avoid splitting any large > >>>> folio that is fully and contiguously mapped in the pageout/cold vm > >>>> range. This change means that large folios will be maintained all the > >>>> way to swap storage. This both improves performance during swap-out, by > >>>> eliding the cost of splitting the folio, and sets us up nicely for > >>>> maintaining the large folio when it is swapped back in (to be covered in > >>>> a separate series). > >>>> > >>>> Folios that are not fully mapped in the target range are still split, > >>>> but note that behavior is changed so that if the split fails for any > >>>> reason (folio locked, shared, etc) we now leave it as is and move to the > >>>> next pte in the range and continue work on the proceeding folios. > >>>> Previously any failure of this sort would cause the entire operation to > >>>> give up and no folios mapped at higher addresses were paged out or made > >>>> cold. Given large folios are becoming more common, this old behavior > >>>> would have likely lead to wasted opportunities. > >>>> > >>>> While we are at it, change the code that clears young from the ptes to > >>>> use ptep_test_and_clear_young(), which is more efficent than > >>>> get_and_clear/modify/set, especially for contpte mappings on arm64, > >>>> where the old approach would require unfolding/refolding and the new > >>>> approach can be done in place. > >>>> > >>>> Signed-off-by: Ryan Roberts <ryan.roberts@xxxxxxx> > >>> > >>> This looks so much better than our initial RFC. > >>> Thank you for your excellent work! > >> > >> Thanks - its a team effort - I had your PoC and David's previous batching work > >> to use as a template. > >> > >>> > >>>> --- > >>>> mm/madvise.c | 89 ++++++++++++++++++++++++++++++---------------------- > >>>> 1 file changed, 51 insertions(+), 38 deletions(-) > >>>> > >>>> diff --git a/mm/madvise.c b/mm/madvise.c > >>>> index 547dcd1f7a39..56c7ba7bd558 100644 > >>>> --- a/mm/madvise.c > >>>> +++ b/mm/madvise.c > >>>> @@ -336,6 +336,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, > >>>> LIST_HEAD(folio_list); > >>>> bool pageout_anon_only_filter; > >>>> unsigned int batch_count = 0; > >>>> + int nr; > >>>> > >>>> if (fatal_signal_pending(current)) > >>>> return -EINTR; > >>>> @@ -423,7 +424,8 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, > >>>> return 0; > >>>> flush_tlb_batched_pending(mm); > >>>> arch_enter_lazy_mmu_mode(); > >>>> - for (; addr < end; pte++, addr += PAGE_SIZE) { > >>>> + for (; addr < end; pte += nr, addr += nr * PAGE_SIZE) { > >>>> + nr = 1; > >>>> ptent = ptep_get(pte); > >>>> > >>>> if (++batch_count == SWAP_CLUSTER_MAX) { > >>>> @@ -447,55 +449,66 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, > >>>> continue; > >>>> > >>>> /* > >>>> - * Creating a THP page is expensive so split it only if we > >>>> - * are sure it's worth. Split it if we are only owner. > >>>> + * If we encounter a large folio, only split it if it is not > >>>> + * fully mapped within the range we are operating on. Otherwise > >>>> + * leave it as is so that it can be swapped out whole. If we > >>>> + * fail to split a folio, leave it in place and advance to the > >>>> + * next pte in the range. > >>>> */ > >>>> if (folio_test_large(folio)) { > >>>> - int err; > >>>> - > >>>> - if (folio_estimated_sharers(folio) > 1) > >>>> - break; > >>>> - if (pageout_anon_only_filter && !folio_test_anon(folio)) > >>>> - break; > >>>> - if (!folio_trylock(folio)) > >>>> - break; > >>>> - folio_get(folio); > >>>> - arch_leave_lazy_mmu_mode(); > >>>> - pte_unmap_unlock(start_pte, ptl); > >>>> - start_pte = NULL; > >>>> - err = split_folio(folio); > >>>> - folio_unlock(folio); > >>>> - folio_put(folio); > >>>> - if (err) > >>>> - break; > >>>> - start_pte = pte = > >>>> - pte_offset_map_lock(mm, pmd, addr, &ptl); > >>>> - if (!start_pte) > >>>> - break; > >>>> - arch_enter_lazy_mmu_mode(); > >>>> - pte--; > >>>> - addr -= PAGE_SIZE; > >>>> - continue; > >>>> + const fpb_t fpb_flags = FPB_IGNORE_DIRTY | > >>>> + FPB_IGNORE_SOFT_DIRTY; > >>>> + int max_nr = (end - addr) / PAGE_SIZE; > >>>> + > >>>> + nr = folio_pte_batch(folio, addr, pte, ptent, max_nr, > >>>> + fpb_flags, NULL); > >>> > >>> I wonder if we have a quick way to avoid folio_pte_batch() if users > >>> are doing madvise() on a portion of a large folio. > >> > >> Good idea. Something like this?: > >> > >> if (pte_pfn(pte) == folio_pfn(folio) > > > > what about > > > > "If (pte_pfn(pte) == folio_pfn(folio) && max_nr >= nr_pages)" > > > > just to account for cases where the user's end address falls within > > the middle of a large folio? > > yes, even better. I'll add this for the next version. > > > > > > > BTW, another minor issue is here: > > > > if (++batch_count == SWAP_CLUSTER_MAX) { > > batch_count = 0; > > if (need_resched()) { > > arch_leave_lazy_mmu_mode(); > > pte_unmap_unlock(start_pte, ptl); > > cond_resched(); > > goto restart; > > } > > } > > > > We are increasing 1 for nr ptes, thus, we are holding PTL longer > > than small folios case? we used to increase 1 for each PTE. > > Does it matter? > > I thought about that, but the vast majority of the work is per-folio, not > per-pte. So I concluded it would be best to continue to increment per-folio. Okay. The original patch commit b2f557a21bc8 ("mm/madvise: add cond_resched() in madvise_cold_or_pageout_pte_range()") primarily addressed the real-time wake-up latency issue. MADV_PAGEOUT and MADV_COLD are much less critical compared to other scenarios where operations like do_anon_page or do_swap_page necessarily need PTL to progress. Therefore, adopting an approach that relatively aggressively releases the PTL seems to neither harm MADV_PAGEOUT/COLD nor disadvantage others. We are slightly increasing the duration of holding the PTL due to the iteration of folio_pte_batch() potentially taking longer than the case of small folios, which do not require it. However, compared to operations like folio_isolate_lru() and folio_deactivate(), this increase seems negligible. Recently, we have actually removed ptep_test_and_clear_young() for MADV_PAGEOUT, which should also benefit real-time scenarios. Nonetheless, there is a small risk with large folios, such as 1 MiB mTHP, where we may need to loop 256 times in folio_pte_batch(). I would vote for increasing 'nr' or maybe max(log2(nr), 1) rather than 1 for two reasons: 1. We are not making MADV_PAGEOUT/COLD worse; in fact, we are improving them by reducing the time taken to put the same number of pages into the reclaim list. 2. MADV_PAGEOUT/COLD scenarios are not urgent compared to others that genuinely require the PTL to progress. Moreover, the majority of time spent on PAGEOUT is actually reclaim_pages(). > > > >> nr = folio_pte_batch(folio, addr, pte, ptent, max_nr, > >> fpb_flags, NULL); > >> > >> If we are not mapping the first page of the folio, then it can't be a full > >> mapping, so no need to call folio_pte_batch(). Just split it. > >> > >>> > >>>> + > >>>> + if (nr < folio_nr_pages(folio)) { > >>>> + int err; > >>>> + > >>>> + if (folio_estimated_sharers(folio) > 1) > >>>> + continue; > >>>> + if (pageout_anon_only_filter && !folio_test_anon(folio)) > >>>> + continue; > >>>> + if (!folio_trylock(folio)) > >>>> + continue; > >>>> + folio_get(folio); > >>>> + arch_leave_lazy_mmu_mode(); > >>>> + pte_unmap_unlock(start_pte, ptl); > >>>> + start_pte = NULL; > >>>> + err = split_folio(folio); > >>>> + folio_unlock(folio); > >>>> + folio_put(folio); > >>>> + if (err) > >>>> + continue; > >>>> + start_pte = pte = > >>>> + pte_offset_map_lock(mm, pmd, addr, &ptl); > >>>> + if (!start_pte) > >>>> + break; > >>>> + arch_enter_lazy_mmu_mode(); > >>>> + nr = 0; > >>>> + continue; > >>>> + } > >>>> } > >>>> > >>>> /* > >>>> * Do not interfere with other mappings of this folio and > >>>> - * non-LRU folio. > >>>> + * non-LRU folio. If we have a large folio at this point, we > >>>> + * know it is fully mapped so if its mapcount is the same as its > >>>> + * number of pages, it must be exclusive. > >>>> */ > >>>> - if (!folio_test_lru(folio) || folio_mapcount(folio) != 1) > >>>> + if (!folio_test_lru(folio) || > >>>> + folio_mapcount(folio) != folio_nr_pages(folio)) > >>>> continue; > >>> > >>> This looks so perfect and is exactly what I wanted to achieve. > >>> > >>>> > >>>> if (pageout_anon_only_filter && !folio_test_anon(folio)) > >>>> continue; > >>>> > >>>> - VM_BUG_ON_FOLIO(folio_test_large(folio), folio); > >>>> - > >>>> - if (!pageout && pte_young(ptent)) { > >>>> - ptent = ptep_get_and_clear_full(mm, addr, pte, > >>>> - tlb->fullmm); > >>>> - ptent = pte_mkold(ptent); > >>>> - set_pte_at(mm, addr, pte, ptent); > >>>> - tlb_remove_tlb_entry(tlb, pte, addr); > >>>> + if (!pageout) { > >>>> + for (; nr != 0; nr--, pte++, addr += PAGE_SIZE) { > >>>> + if (ptep_test_and_clear_young(vma, addr, pte)) > >>>> + tlb_remove_tlb_entry(tlb, pte, addr); > >>>> + } > >>> > >>> This looks so smart. if it is not pageout, we have increased pte > >>> and addr here; so nr is 0 and we don't need to increase again in > >>> for (; addr < end; pte += nr, addr += nr * PAGE_SIZE) > >>> > >>> otherwise, nr won't be 0. so we will increase addr and > >>> pte by nr. > >> > >> Indeed. I'm hoping that Lance is able to follow a similar pattern for > >> madvise_free_pte_range(). > >> > >> > >>> > >>> > >>>> } > >>>> > >>>> /* > >>>> -- > >>>> 2.25.1 > >>>> > >>> > >>> Overall, LGTM, > >>> > >>> Reviewed-by: Barry Song <v-songbaohua@xxxxxxxx> > >> > >> Thanks! > >> Thanks Barry