On Tue, Mar 12, 2024 at 4:01 AM Ryan Roberts <ryan.roberts@xxxxxxx> wrote: > > Rework madvise_cold_or_pageout_pte_range() to avoid splitting any large > folio that is fully and contiguously mapped in the pageout/cold vm > range. This change means that large folios will be maintained all the > way to swap storage. This both improves performance during swap-out, by > eliding the cost of splitting the folio, and sets us up nicely for > maintaining the large folio when it is swapped back in (to be covered in > a separate series). > > Folios that are not fully mapped in the target range are still split, > but note that behavior is changed so that if the split fails for any > reason (folio locked, shared, etc) we now leave it as is and move to the > next pte in the range and continue work on the proceeding folios. > Previously any failure of this sort would cause the entire operation to > give up and no folios mapped at higher addresses were paged out or made > cold. Given large folios are becoming more common, this old behavior > would have likely lead to wasted opportunities. > > While we are at it, change the code that clears young from the ptes to > use ptep_test_and_clear_young(), which is more efficent than > get_and_clear/modify/set, especially for contpte mappings on arm64, > where the old approach would require unfolding/refolding and the new > approach can be done in place. > > Signed-off-by: Ryan Roberts <ryan.roberts@xxxxxxx> This looks so much better than our initial RFC. Thank you for your excellent work! > --- > mm/madvise.c | 89 ++++++++++++++++++++++++++++++---------------------- > 1 file changed, 51 insertions(+), 38 deletions(-) > > diff --git a/mm/madvise.c b/mm/madvise.c > index 547dcd1f7a39..56c7ba7bd558 100644 > --- a/mm/madvise.c > +++ b/mm/madvise.c > @@ -336,6 +336,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, > LIST_HEAD(folio_list); > bool pageout_anon_only_filter; > unsigned int batch_count = 0; > + int nr; > > if (fatal_signal_pending(current)) > return -EINTR; > @@ -423,7 +424,8 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, > return 0; > flush_tlb_batched_pending(mm); > arch_enter_lazy_mmu_mode(); > - for (; addr < end; pte++, addr += PAGE_SIZE) { > + for (; addr < end; pte += nr, addr += nr * PAGE_SIZE) { > + nr = 1; > ptent = ptep_get(pte); > > if (++batch_count == SWAP_CLUSTER_MAX) { > @@ -447,55 +449,66 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, > continue; > > /* > - * Creating a THP page is expensive so split it only if we > - * are sure it's worth. Split it if we are only owner. > + * If we encounter a large folio, only split it if it is not > + * fully mapped within the range we are operating on. Otherwise > + * leave it as is so that it can be swapped out whole. If we > + * fail to split a folio, leave it in place and advance to the > + * next pte in the range. > */ > if (folio_test_large(folio)) { > - int err; > - > - if (folio_estimated_sharers(folio) > 1) > - break; > - if (pageout_anon_only_filter && !folio_test_anon(folio)) > - break; > - if (!folio_trylock(folio)) > - break; > - folio_get(folio); > - arch_leave_lazy_mmu_mode(); > - pte_unmap_unlock(start_pte, ptl); > - start_pte = NULL; > - err = split_folio(folio); > - folio_unlock(folio); > - folio_put(folio); > - if (err) > - break; > - start_pte = pte = > - pte_offset_map_lock(mm, pmd, addr, &ptl); > - if (!start_pte) > - break; > - arch_enter_lazy_mmu_mode(); > - pte--; > - addr -= PAGE_SIZE; > - continue; > + const fpb_t fpb_flags = FPB_IGNORE_DIRTY | > + FPB_IGNORE_SOFT_DIRTY; > + int max_nr = (end - addr) / PAGE_SIZE; > + > + nr = folio_pte_batch(folio, addr, pte, ptent, max_nr, > + fpb_flags, NULL); I wonder if we have a quick way to avoid folio_pte_batch() if users are doing madvise() on a portion of a large folio. > + > + if (nr < folio_nr_pages(folio)) { > + int err; > + > + if (folio_estimated_sharers(folio) > 1) > + continue; > + if (pageout_anon_only_filter && !folio_test_anon(folio)) > + continue; > + if (!folio_trylock(folio)) > + continue; > + folio_get(folio); > + arch_leave_lazy_mmu_mode(); > + pte_unmap_unlock(start_pte, ptl); > + start_pte = NULL; > + err = split_folio(folio); > + folio_unlock(folio); > + folio_put(folio); > + if (err) > + continue; > + start_pte = pte = > + pte_offset_map_lock(mm, pmd, addr, &ptl); > + if (!start_pte) > + break; > + arch_enter_lazy_mmu_mode(); > + nr = 0; > + continue; > + } > } > > /* > * Do not interfere with other mappings of this folio and > - * non-LRU folio. > + * non-LRU folio. If we have a large folio at this point, we > + * know it is fully mapped so if its mapcount is the same as its > + * number of pages, it must be exclusive. > */ > - if (!folio_test_lru(folio) || folio_mapcount(folio) != 1) > + if (!folio_test_lru(folio) || > + folio_mapcount(folio) != folio_nr_pages(folio)) > continue; This looks so perfect and is exactly what I wanted to achieve. > > if (pageout_anon_only_filter && !folio_test_anon(folio)) > continue; > > - VM_BUG_ON_FOLIO(folio_test_large(folio), folio); > - > - if (!pageout && pte_young(ptent)) { > - ptent = ptep_get_and_clear_full(mm, addr, pte, > - tlb->fullmm); > - ptent = pte_mkold(ptent); > - set_pte_at(mm, addr, pte, ptent); > - tlb_remove_tlb_entry(tlb, pte, addr); > + if (!pageout) { > + for (; nr != 0; nr--, pte++, addr += PAGE_SIZE) { > + if (ptep_test_and_clear_young(vma, addr, pte)) > + tlb_remove_tlb_entry(tlb, pte, addr); > + } This looks so smart. if it is not pageout, we have increased pte and addr here; so nr is 0 and we don't need to increase again in for (; addr < end; pte += nr, addr += nr * PAGE_SIZE) otherwise, nr won't be 0. so we will increase addr and pte by nr. > } > > /* > -- > 2.25.1 > Overall, LGTM, Reviewed-by: Barry Song <v-songbaohua@xxxxxxxx>