On Wed, Mar 20, 2024 at 9:49 PM Ryan Roberts <ryan.roberts@xxxxxxx> wrote: > > Hi Lance, Barry, > > Sorry - I totally missed this when you originally sent it! No worries at all :) > > > On 13/03/2024 14:02, Lance Yang wrote: > > On Wed, Mar 13, 2024 at 5:03 PM Ryan Roberts <ryan.roberts@xxxxxxx> wrote: > >> > >> On 13/03/2024 07:19, Barry Song wrote: > >>> On Tue, Mar 12, 2024 at 4:01 AM Ryan Roberts <ryan.roberts@xxxxxxx> wrote: > >>>> > >>>> Rework madvise_cold_or_pageout_pte_range() to avoid splitting any large > >>>> folio that is fully and contiguously mapped in the pageout/cold vm > >>>> range. This change means that large folios will be maintained all the > >>>> way to swap storage. This both improves performance during swap-out, by > >>>> eliding the cost of splitting the folio, and sets us up nicely for > >>>> maintaining the large folio when it is swapped back in (to be covered in > >>>> a separate series). > >>>> > >>>> Folios that are not fully mapped in the target range are still split, > >>>> but note that behavior is changed so that if the split fails for any > >>>> reason (folio locked, shared, etc) we now leave it as is and move to the > >>>> next pte in the range and continue work on the proceeding folios. > >>>> Previously any failure of this sort would cause the entire operation to > >>>> give up and no folios mapped at higher addresses were paged out or made > >>>> cold. Given large folios are becoming more common, this old behavior > >>>> would have likely lead to wasted opportunities. > >>>> > >>>> While we are at it, change the code that clears young from the ptes to > >>>> use ptep_test_and_clear_young(), which is more efficent than > >>>> get_and_clear/modify/set, especially for contpte mappings on arm64, > >>>> where the old approach would require unfolding/refolding and the new > >>>> approach can be done in place. > >>>> > >>>> Signed-off-by: Ryan Roberts <ryan.roberts@xxxxxxx> > >>> > >>> This looks so much better than our initial RFC. > >>> Thank you for your excellent work! > >> > >> Thanks - its a team effort - I had your PoC and David's previous batching work > >> to use as a template. > >> > >>> > >>>> --- > >>>> mm/madvise.c | 89 ++++++++++++++++++++++++++++++---------------------- > >>>> 1 file changed, 51 insertions(+), 38 deletions(-) > >>>> > >>>> diff --git a/mm/madvise.c b/mm/madvise.c > >>>> index 547dcd1f7a39..56c7ba7bd558 100644 > >>>> --- a/mm/madvise.c > >>>> +++ b/mm/madvise.c > >>>> @@ -336,6 +336,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, > >>>> LIST_HEAD(folio_list); > >>>> bool pageout_anon_only_filter; > >>>> unsigned int batch_count = 0; > >>>> + int nr; > >>>> > >>>> if (fatal_signal_pending(current)) > >>>> return -EINTR; > >>>> @@ -423,7 +424,8 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, > >>>> return 0; > >>>> flush_tlb_batched_pending(mm); > >>>> arch_enter_lazy_mmu_mode(); > >>>> - for (; addr < end; pte++, addr += PAGE_SIZE) { > >>>> + for (; addr < end; pte += nr, addr += nr * PAGE_SIZE) { > >>>> + nr = 1; > >>>> ptent = ptep_get(pte); > >>>> > >>>> if (++batch_count == SWAP_CLUSTER_MAX) { > >>>> @@ -447,55 +449,66 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, > >>>> continue; > >>>> > >>>> /* > >>>> - * Creating a THP page is expensive so split it only if we > >>>> - * are sure it's worth. Split it if we are only owner. > >>>> + * If we encounter a large folio, only split it if it is not > >>>> + * fully mapped within the range we are operating on. Otherwise > >>>> + * leave it as is so that it can be swapped out whole. If we > >>>> + * fail to split a folio, leave it in place and advance to the > >>>> + * next pte in the range. > >>>> */ > >>>> if (folio_test_large(folio)) { > >>>> - int err; > >>>> - > >>>> - if (folio_estimated_sharers(folio) > 1) > >>>> - break; > >>>> - if (pageout_anon_only_filter && !folio_test_anon(folio)) > >>>> - break; > >>>> - if (!folio_trylock(folio)) > >>>> - break; > >>>> - folio_get(folio); > >>>> - arch_leave_lazy_mmu_mode(); > >>>> - pte_unmap_unlock(start_pte, ptl); > >>>> - start_pte = NULL; > >>>> - err = split_folio(folio); > >>>> - folio_unlock(folio); > >>>> - folio_put(folio); > >>>> - if (err) > >>>> - break; > >>>> - start_pte = pte = > >>>> - pte_offset_map_lock(mm, pmd, addr, &ptl); > >>>> - if (!start_pte) > >>>> - break; > >>>> - arch_enter_lazy_mmu_mode(); > >>>> - pte--; > >>>> - addr -= PAGE_SIZE; > >>>> - continue; > >>>> + const fpb_t fpb_flags = FPB_IGNORE_DIRTY | > >>>> + FPB_IGNORE_SOFT_DIRTY; > >>>> + int max_nr = (end - addr) / PAGE_SIZE; > >>>> + > >>>> + nr = folio_pte_batch(folio, addr, pte, ptent, max_nr, > >>>> + fpb_flags, NULL); > >>> > >>> I wonder if we have a quick way to avoid folio_pte_batch() if users > >>> are doing madvise() on a portion of a large folio. > >> > >> Good idea. Something like this?: > >> > >> if (pte_pfn(pte) == folio_pfn(folio) > >> nr = folio_pte_batch(folio, addr, pte, ptent, max_nr, > >> fpb_flags, NULL); > >> > >> If we are not mapping the first page of the folio, then it can't be a full > >> mapping, so no need to call folio_pte_batch(). Just split it. > > > > if (folio_test_large(folio)) { > > [...] > > nr = folio_pte_batch(folio, addr, pte, ptent, max_nr, > > fpb_flags, NULL); > > + if (folio_estimated_sharers(folio) > 1) > > + continue; > > > > Could we use folio_estimated_sharers as an early exit point here? > > I'm not sure what this is saving where you have it? Did you mean to put it > before folio_pte_batch()? Currently it is just saving a single conditional. Apologies for the confusion. I made a diff to provide clarity. diff --git a/mm/madvise.c b/mm/madvise.c index 56c7ba7bd558..c3458fdea82a 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -462,12 +462,11 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, nr = folio_pte_batch(folio, addr, pte, ptent, max_nr, fpb_flags, NULL); - // Could we use folio_estimated_sharers as an early exit point here? + if (folio_estimated_sharers(folio) > 1) + continue; if (nr < folio_nr_pages(folio)) { int err; - if (folio_estimated_sharers(folio) > 1) - continue; if (pageout_anon_only_filter && !folio_test_anon(folio)) continue; if (!folio_trylock(folio)) > > But now that I think about it a bit more, I remember why I was originally > unconditionally calling folio_pte_batch(). Given its a large folio, if the split > fails, we can move the cursor to the pte where the next folio begins so we don't > have to iterate through one pte at a time which would cause us to keep calling > folio_estimated_sharers(), folio_test_anon(), etc on the same folio until we get > to the next boundary. > > Of course the common case at this point will be for the split to succeed, but > then we are going to iterate over ever single PTE anyway - one way or another > they are all fetched into cache. So I feel like its neater not to add the > conditionals for calling folio_pte_batch(), and just leave this as I have it here. > > > > > if (nr < folio_nr_pages(folio)) { > > int err; > > > > - if (folio_estimated_sharers(folio) > 1) > > - continue; > > [...] > > > >> > >>> > >>>> + > >>>> + if (nr < folio_nr_pages(folio)) { > >>>> + int err; > >>>> + > >>>> + if (folio_estimated_sharers(folio) > 1) > >>>> + continue; > >>>> + if (pageout_anon_only_filter && !folio_test_anon(folio)) > >>>> + continue; > >>>> + if (!folio_trylock(folio)) > >>>> + continue; > >>>> + folio_get(folio); > >>>> + arch_leave_lazy_mmu_mode(); > >>>> + pte_unmap_unlock(start_pte, ptl); > >>>> + start_pte = NULL; > >>>> + err = split_folio(folio); > >>>> + folio_unlock(folio); > >>>> + folio_put(folio); > >>>> + if (err) > >>>> + continue; > >>>> + start_pte = pte = > >>>> + pte_offset_map_lock(mm, pmd, addr, &ptl); > >>>> + if (!start_pte) > >>>> + break; > >>>> + arch_enter_lazy_mmu_mode(); > >>>> + nr = 0; > >>>> + continue; > >>>> + } > >>>> } > >>>> > >>>> /* > >>>> * Do not interfere with other mappings of this folio and > >>>> - * non-LRU folio. > >>>> + * non-LRU folio. If we have a large folio at this point, we > >>>> + * know it is fully mapped so if its mapcount is the same as its > >>>> + * number of pages, it must be exclusive. > >>>> */ > >>>> - if (!folio_test_lru(folio) || folio_mapcount(folio) != 1) > >>>> + if (!folio_test_lru(folio) || > >>>> + folio_mapcount(folio) != folio_nr_pages(folio)) > >>>> continue; > >>> > >>> This looks so perfect and is exactly what I wanted to achieve. > >>> > >>>> > >>>> if (pageout_anon_only_filter && !folio_test_anon(folio)) > >>>> continue; > >>>> > >>>> - VM_BUG_ON_FOLIO(folio_test_large(folio), folio); > >>>> - > >>>> - if (!pageout && pte_young(ptent)) { > >>>> - ptent = ptep_get_and_clear_full(mm, addr, pte, > >>>> - tlb->fullmm); > >>>> - ptent = pte_mkold(ptent); > >>>> - set_pte_at(mm, addr, pte, ptent); > >>>> - tlb_remove_tlb_entry(tlb, pte, addr); > >>>> + if (!pageout) { > >>>> + for (; nr != 0; nr--, pte++, addr += PAGE_SIZE) { > >>>> + if (ptep_test_and_clear_young(vma, addr, pte)) > >>>> + tlb_remove_tlb_entry(tlb, pte, addr); > > > > IIRC, some of the architecture(ex, PPC) don't update TLB with set_pte_at and > > tlb_remove_tlb_entry. So, didn't we consider remapping the PTE with old after > > pte clearing? > > Sorry Lance, I don't understand this question, can you rephrase? Are you saying > there is a good reason to do the original clear-mkold-set for some arches? IIRC, some of the architecture(ex, PPC) don't update TLB with ptep_test_and_clear_young() and tlb_remove_tlb_entry(). In my new patch[1], I use refresh_full_ptes() and tlb_remove_tlb_entries() to batch-update the access and dirty bits. [1] https://lore.kernel.org/linux-mm/20240316102952.39233-1-ioworker0@xxxxxxxxx Thanks, Lance > > > > > Thanks, > > Lance > > > > > > > >>>> + } > >>> > >>> This looks so smart. if it is not pageout, we have increased pte > >>> and addr here; so nr is 0 and we don't need to increase again in > >>> for (; addr < end; pte += nr, addr += nr * PAGE_SIZE) > >>> > >>> otherwise, nr won't be 0. so we will increase addr and > >>> pte by nr. > >> > >> Indeed. I'm hoping that Lance is able to follow a similar pattern for > >> madvise_free_pte_range(). > >> > >> > >>> > >>> > >>>> } > >>>> > >>>> /* > >>>> -- > >>>> 2.25.1 > >>>> > >>> > >>> Overall, LGTM, > >>> > >>> Reviewed-by: Barry Song <v-songbaohua@xxxxxxxx> > >> > >> Thanks! > >> > >> >