On 13/03/2024 09:16, Barry Song wrote: > On Wed, Mar 13, 2024 at 10:03 PM Ryan Roberts <ryan.roberts@xxxxxxx> wrote: >> >> On 13/03/2024 07:19, Barry Song wrote: >>> On Tue, Mar 12, 2024 at 4:01 AM Ryan Roberts <ryan.roberts@xxxxxxx> wrote: >>>> >>>> Rework madvise_cold_or_pageout_pte_range() to avoid splitting any large >>>> folio that is fully and contiguously mapped in the pageout/cold vm >>>> range. This change means that large folios will be maintained all the >>>> way to swap storage. This both improves performance during swap-out, by >>>> eliding the cost of splitting the folio, and sets us up nicely for >>>> maintaining the large folio when it is swapped back in (to be covered in >>>> a separate series). >>>> >>>> Folios that are not fully mapped in the target range are still split, >>>> but note that behavior is changed so that if the split fails for any >>>> reason (folio locked, shared, etc) we now leave it as is and move to the >>>> next pte in the range and continue work on the proceeding folios. >>>> Previously any failure of this sort would cause the entire operation to >>>> give up and no folios mapped at higher addresses were paged out or made >>>> cold. Given large folios are becoming more common, this old behavior >>>> would have likely lead to wasted opportunities. >>>> >>>> While we are at it, change the code that clears young from the ptes to >>>> use ptep_test_and_clear_young(), which is more efficent than >>>> get_and_clear/modify/set, especially for contpte mappings on arm64, >>>> where the old approach would require unfolding/refolding and the new >>>> approach can be done in place. >>>> >>>> Signed-off-by: Ryan Roberts <ryan.roberts@xxxxxxx> >>> >>> This looks so much better than our initial RFC. >>> Thank you for your excellent work! >> >> Thanks - its a team effort - I had your PoC and David's previous batching work >> to use as a template. >> >>> >>>> --- >>>> mm/madvise.c | 89 ++++++++++++++++++++++++++++++---------------------- >>>> 1 file changed, 51 insertions(+), 38 deletions(-) >>>> >>>> diff --git a/mm/madvise.c b/mm/madvise.c >>>> index 547dcd1f7a39..56c7ba7bd558 100644 >>>> --- a/mm/madvise.c >>>> +++ b/mm/madvise.c >>>> @@ -336,6 +336,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, >>>> LIST_HEAD(folio_list); >>>> bool pageout_anon_only_filter; >>>> unsigned int batch_count = 0; >>>> + int nr; >>>> >>>> if (fatal_signal_pending(current)) >>>> return -EINTR; >>>> @@ -423,7 +424,8 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, >>>> return 0; >>>> flush_tlb_batched_pending(mm); >>>> arch_enter_lazy_mmu_mode(); >>>> - for (; addr < end; pte++, addr += PAGE_SIZE) { >>>> + for (; addr < end; pte += nr, addr += nr * PAGE_SIZE) { >>>> + nr = 1; >>>> ptent = ptep_get(pte); >>>> >>>> if (++batch_count == SWAP_CLUSTER_MAX) { >>>> @@ -447,55 +449,66 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, >>>> continue; >>>> >>>> /* >>>> - * Creating a THP page is expensive so split it only if we >>>> - * are sure it's worth. Split it if we are only owner. >>>> + * If we encounter a large folio, only split it if it is not >>>> + * fully mapped within the range we are operating on. Otherwise >>>> + * leave it as is so that it can be swapped out whole. If we >>>> + * fail to split a folio, leave it in place and advance to the >>>> + * next pte in the range. >>>> */ >>>> if (folio_test_large(folio)) { >>>> - int err; >>>> - >>>> - if (folio_estimated_sharers(folio) > 1) >>>> - break; >>>> - if (pageout_anon_only_filter && !folio_test_anon(folio)) >>>> - break; >>>> - if (!folio_trylock(folio)) >>>> - break; >>>> - folio_get(folio); >>>> - arch_leave_lazy_mmu_mode(); >>>> - pte_unmap_unlock(start_pte, ptl); >>>> - start_pte = NULL; >>>> - err = split_folio(folio); >>>> - folio_unlock(folio); >>>> - folio_put(folio); >>>> - if (err) >>>> - break; >>>> - start_pte = pte = >>>> - pte_offset_map_lock(mm, pmd, addr, &ptl); >>>> - if (!start_pte) >>>> - break; >>>> - arch_enter_lazy_mmu_mode(); >>>> - pte--; >>>> - addr -= PAGE_SIZE; >>>> - continue; >>>> + const fpb_t fpb_flags = FPB_IGNORE_DIRTY | >>>> + FPB_IGNORE_SOFT_DIRTY; >>>> + int max_nr = (end - addr) / PAGE_SIZE; >>>> + >>>> + nr = folio_pte_batch(folio, addr, pte, ptent, max_nr, >>>> + fpb_flags, NULL); >>> >>> I wonder if we have a quick way to avoid folio_pte_batch() if users >>> are doing madvise() on a portion of a large folio. >> >> Good idea. Something like this?: >> >> if (pte_pfn(pte) == folio_pfn(folio) > > what about > > "If (pte_pfn(pte) == folio_pfn(folio) && max_nr >= nr_pages)" > > just to account for cases where the user's end address falls within > the middle of a large folio? yes, even better. I'll add this for the next version. > > > BTW, another minor issue is here: > > if (++batch_count == SWAP_CLUSTER_MAX) { > batch_count = 0; > if (need_resched()) { > arch_leave_lazy_mmu_mode(); > pte_unmap_unlock(start_pte, ptl); > cond_resched(); > goto restart; > } > } > > We are increasing 1 for nr ptes, thus, we are holding PTL longer > than small folios case? we used to increase 1 for each PTE. > Does it matter? I thought about that, but the vast majority of the work is per-folio, not per-pte. So I concluded it would be best to continue to increment per-folio. > >> nr = folio_pte_batch(folio, addr, pte, ptent, max_nr, >> fpb_flags, NULL); >> >> If we are not mapping the first page of the folio, then it can't be a full >> mapping, so no need to call folio_pte_batch(). Just split it. >> >>> >>>> + >>>> + if (nr < folio_nr_pages(folio)) { >>>> + int err; >>>> + >>>> + if (folio_estimated_sharers(folio) > 1) >>>> + continue; >>>> + if (pageout_anon_only_filter && !folio_test_anon(folio)) >>>> + continue; >>>> + if (!folio_trylock(folio)) >>>> + continue; >>>> + folio_get(folio); >>>> + arch_leave_lazy_mmu_mode(); >>>> + pte_unmap_unlock(start_pte, ptl); >>>> + start_pte = NULL; >>>> + err = split_folio(folio); >>>> + folio_unlock(folio); >>>> + folio_put(folio); >>>> + if (err) >>>> + continue; >>>> + start_pte = pte = >>>> + pte_offset_map_lock(mm, pmd, addr, &ptl); >>>> + if (!start_pte) >>>> + break; >>>> + arch_enter_lazy_mmu_mode(); >>>> + nr = 0; >>>> + continue; >>>> + } >>>> } >>>> >>>> /* >>>> * Do not interfere with other mappings of this folio and >>>> - * non-LRU folio. >>>> + * non-LRU folio. If we have a large folio at this point, we >>>> + * know it is fully mapped so if its mapcount is the same as its >>>> + * number of pages, it must be exclusive. >>>> */ >>>> - if (!folio_test_lru(folio) || folio_mapcount(folio) != 1) >>>> + if (!folio_test_lru(folio) || >>>> + folio_mapcount(folio) != folio_nr_pages(folio)) >>>> continue; >>> >>> This looks so perfect and is exactly what I wanted to achieve. >>> >>>> >>>> if (pageout_anon_only_filter && !folio_test_anon(folio)) >>>> continue; >>>> >>>> - VM_BUG_ON_FOLIO(folio_test_large(folio), folio); >>>> - >>>> - if (!pageout && pte_young(ptent)) { >>>> - ptent = ptep_get_and_clear_full(mm, addr, pte, >>>> - tlb->fullmm); >>>> - ptent = pte_mkold(ptent); >>>> - set_pte_at(mm, addr, pte, ptent); >>>> - tlb_remove_tlb_entry(tlb, pte, addr); >>>> + if (!pageout) { >>>> + for (; nr != 0; nr--, pte++, addr += PAGE_SIZE) { >>>> + if (ptep_test_and_clear_young(vma, addr, pte)) >>>> + tlb_remove_tlb_entry(tlb, pte, addr); >>>> + } >>> >>> This looks so smart. if it is not pageout, we have increased pte >>> and addr here; so nr is 0 and we don't need to increase again in >>> for (; addr < end; pte += nr, addr += nr * PAGE_SIZE) >>> >>> otherwise, nr won't be 0. so we will increase addr and >>> pte by nr. >> >> Indeed. I'm hoping that Lance is able to follow a similar pattern for >> madvise_free_pte_range(). >> >> >>> >>> >>>> } >>>> >>>> /* >>>> -- >>>> 2.25.1 >>>> >>> >>> Overall, LGTM, >>> >>> Reviewed-by: Barry Song <v-songbaohua@xxxxxxxx> >> >> Thanks! >> >>