On Wed, Mar 13, 2024 at 7:08 PM Ryan Roberts <ryan.roberts@xxxxxxx> wrote: > > On 13/03/2024 10:37, Barry Song wrote: > > On Wed, Mar 13, 2024 at 10:36 PM Ryan Roberts <ryan.roberts@xxxxxxx> wrote: > >> > >> On 13/03/2024 09:16, Barry Song wrote: > >>> On Wed, Mar 13, 2024 at 10:03 PM Ryan Roberts <ryan.roberts@xxxxxxx> wrote: > >>>> > >>>> On 13/03/2024 07:19, Barry Song wrote: > >>>>> On Tue, Mar 12, 2024 at 4:01 AM Ryan Roberts <ryan.roberts@xxxxxxx> wrote: > >>>>>> > >>>>>> Rework madvise_cold_or_pageout_pte_range() to avoid splitting any large > >>>>>> folio that is fully and contiguously mapped in the pageout/cold vm > >>>>>> range. This change means that large folios will be maintained all the > >>>>>> way to swap storage. This both improves performance during swap-out, by > >>>>>> eliding the cost of splitting the folio, and sets us up nicely for > >>>>>> maintaining the large folio when it is swapped back in (to be covered in > >>>>>> a separate series). > >>>>>> > >>>>>> Folios that are not fully mapped in the target range are still split, > >>>>>> but note that behavior is changed so that if the split fails for any > >>>>>> reason (folio locked, shared, etc) we now leave it as is and move to the > >>>>>> next pte in the range and continue work on the proceeding folios. > >>>>>> Previously any failure of this sort would cause the entire operation to > >>>>>> give up and no folios mapped at higher addresses were paged out or made > >>>>>> cold. Given large folios are becoming more common, this old behavior > >>>>>> would have likely lead to wasted opportunities. > >>>>>> > >>>>>> While we are at it, change the code that clears young from the ptes to > >>>>>> use ptep_test_and_clear_young(), which is more efficent than > >>>>>> get_and_clear/modify/set, especially for contpte mappings on arm64, > >>>>>> where the old approach would require unfolding/refolding and the new > >>>>>> approach can be done in place. > >>>>>> > >>>>>> Signed-off-by: Ryan Roberts <ryan.roberts@xxxxxxx> > >>>>> > >>>>> This looks so much better than our initial RFC. > >>>>> Thank you for your excellent work! > >>>> > >>>> Thanks - its a team effort - I had your PoC and David's previous batching work > >>>> to use as a template. > >>>> > >>>>> > >>>>>> --- > >>>>>> mm/madvise.c | 89 ++++++++++++++++++++++++++++++---------------------- > >>>>>> 1 file changed, 51 insertions(+), 38 deletions(-) > >>>>>> > >>>>>> diff --git a/mm/madvise.c b/mm/madvise.c > >>>>>> index 547dcd1f7a39..56c7ba7bd558 100644 > >>>>>> --- a/mm/madvise.c > >>>>>> +++ b/mm/madvise.c > >>>>>> @@ -336,6 +336,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, > >>>>>> LIST_HEAD(folio_list); > >>>>>> bool pageout_anon_only_filter; > >>>>>> unsigned int batch_count = 0; > >>>>>> + int nr; > >>>>>> > >>>>>> if (fatal_signal_pending(current)) > >>>>>> return -EINTR; > >>>>>> @@ -423,7 +424,8 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, > >>>>>> return 0; > >>>>>> flush_tlb_batched_pending(mm); > >>>>>> arch_enter_lazy_mmu_mode(); > >>>>>> - for (; addr < end; pte++, addr += PAGE_SIZE) { > >>>>>> + for (; addr < end; pte += nr, addr += nr * PAGE_SIZE) { > >>>>>> + nr = 1; > >>>>>> ptent = ptep_get(pte); > >>>>>> > >>>>>> if (++batch_count == SWAP_CLUSTER_MAX) { > >>>>>> @@ -447,55 +449,66 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, > >>>>>> continue; > >>>>>> > >>>>>> /* > >>>>>> - * Creating a THP page is expensive so split it only if we > >>>>>> - * are sure it's worth. Split it if we are only owner. > >>>>>> + * If we encounter a large folio, only split it if it is not > >>>>>> + * fully mapped within the range we are operating on. Otherwise > >>>>>> + * leave it as is so that it can be swapped out whole. If we > >>>>>> + * fail to split a folio, leave it in place and advance to the > >>>>>> + * next pte in the range. > >>>>>> */ > >>>>>> if (folio_test_large(folio)) { > >>>>>> - int err; > >>>>>> - > >>>>>> - if (folio_estimated_sharers(folio) > 1) > >>>>>> - break; > >>>>>> - if (pageout_anon_only_filter && !folio_test_anon(folio)) > >>>>>> - break; > >>>>>> - if (!folio_trylock(folio)) > >>>>>> - break; > >>>>>> - folio_get(folio); > >>>>>> - arch_leave_lazy_mmu_mode(); > >>>>>> - pte_unmap_unlock(start_pte, ptl); > >>>>>> - start_pte = NULL; > >>>>>> - err = split_folio(folio); > >>>>>> - folio_unlock(folio); > >>>>>> - folio_put(folio); > >>>>>> - if (err) > >>>>>> - break; > >>>>>> - start_pte = pte = > >>>>>> - pte_offset_map_lock(mm, pmd, addr, &ptl); > >>>>>> - if (!start_pte) > >>>>>> - break; > >>>>>> - arch_enter_lazy_mmu_mode(); > >>>>>> - pte--; > >>>>>> - addr -= PAGE_SIZE; > >>>>>> - continue; > >>>>>> + const fpb_t fpb_flags = FPB_IGNORE_DIRTY | > >>>>>> + FPB_IGNORE_SOFT_DIRTY; > >>>>>> + int max_nr = (end - addr) / PAGE_SIZE; > >>>>>> + > >>>>>> + nr = folio_pte_batch(folio, addr, pte, ptent, max_nr, > >>>>>> + fpb_flags, NULL); > >>>>> > >>>>> I wonder if we have a quick way to avoid folio_pte_batch() if users > >>>>> are doing madvise() on a portion of a large folio. > >>>> > >>>> Good idea. Something like this?: > >>>> > >>>> if (pte_pfn(pte) == folio_pfn(folio) > >>> > >>> what about > >>> > >>> "If (pte_pfn(pte) == folio_pfn(folio) && max_nr >= nr_pages)" > >>> > >>> just to account for cases where the user's end address falls within > >>> the middle of a large folio? > >> > >> yes, even better. I'll add this for the next version. > >> > >>> > >>> > >>> BTW, another minor issue is here: > >>> > >>> if (++batch_count == SWAP_CLUSTER_MAX) { > >>> batch_count = 0; > >>> if (need_resched()) { > >>> arch_leave_lazy_mmu_mode(); > >>> pte_unmap_unlock(start_pte, ptl); > >>> cond_resched(); > >>> goto restart; > >>> } > >>> } > >>> > >>> We are increasing 1 for nr ptes, thus, we are holding PTL longer > >>> than small folios case? we used to increase 1 for each PTE. > >>> Does it matter? > >> > >> I thought about that, but the vast majority of the work is per-folio, not > >> per-pte. So I concluded it would be best to continue to increment per-folio. > > > > Okay. The original patch commit b2f557a21bc8 ("mm/madvise: add > > cond_resched() in madvise_cold_or_pageout_pte_range()") > > primarily addressed the real-time wake-up latency issue. MADV_PAGEOUT > > and MADV_COLD are much less critical compared > > to other scenarios where operations like do_anon_page or do_swap_page > > necessarily need PTL to progress. Therefore, adopting > > an approach that relatively aggressively releases the PTL seems to > > neither harm MADV_PAGEOUT/COLD nor disadvantage > > others. > > > > We are slightly increasing the duration of holding the PTL due to the > > iteration of folio_pte_batch() potentially taking longer than > > the case of small folios, which do not require it. > > If we can't scan all the PTEs in a page table without dropping the PTL > intermittently we have bigger problems. This all works perfectly fine in all the > other PTE iterators; see zap_pte_range() for example. I've no doubt about folio_pte_batch(). was just talking about the original rt issue it might affect. > > > However, compared > > to operations like folio_isolate_lru() and folio_deactivate(), > > this increase seems negligible. Recently, we have actually removed > > ptep_test_and_clear_young() for MADV_PAGEOUT, > > which should also benefit real-time scenarios. Nonetheless, there is a > > small risk with large folios, such as 1 MiB mTHP, where > > we may need to loop 256 times in folio_pte_batch(). > > As I understand it, RT and THP are mutually exclusive. RT can't handle the extra > latencies THPs can cause in allocation path, etc. So I don't think you will see > a problem here. I was actually taking a different approach on the phones as obviously we have some UX(user-experience)/UI/audio related tasks which cannot tolerate allocation latency. with a TAO-similar optimization(we did it by ext_migratetype for some pageblocks), we actually don't push buddy to do compaction or reclamation for forming 64KiB folio. We immediately fallback to small folios if a zero-latency 64KiB allocation can't be obtained from some kinds of pools - ext_migratetype pageblocks. > > > > > I would vote for increasing 'nr' or maybe max(log2(nr), 1) rather than > > 1 for two reasons: > > > > 1. We are not making MADV_PAGEOUT/COLD worse; in fact, we are > > improving them by reducing the time taken to put the same > > number of pages into the reclaim list. > > > > 2. MADV_PAGEOUT/COLD scenarios are not urgent compared to others that > > genuinely require the PTL to progress. Moreover, > > the majority of time spent on PAGEOUT is actually reclaim_pages(). > > I understand your logic. But I'd rather optimize for fewer lock acquisitions for > the !RT+THP case, since RT+THP is not supported. Fair enough. I agree we can postpone this until RT and THP become an available option. For now, keeping this patch simpler seems to be better. > > > > >>> > >>>> nr = folio_pte_batch(folio, addr, pte, ptent, max_nr, > >>>> fpb_flags, NULL); > >>>> > >>>> If we are not mapping the first page of the folio, then it can't be a full > >>>> mapping, so no need to call folio_pte_batch(). Just split it. > >>>> > >>>>> > >>>>>> + > >>>>>> + if (nr < folio_nr_pages(folio)) { > >>>>>> + int err; > >>>>>> + > >>>>>> + if (folio_estimated_sharers(folio) > 1) > >>>>>> + continue; > >>>>>> + if (pageout_anon_only_filter && !folio_test_anon(folio)) > >>>>>> + continue; > >>>>>> + if (!folio_trylock(folio)) > >>>>>> + continue; > >>>>>> + folio_get(folio); > >>>>>> + arch_leave_lazy_mmu_mode(); > >>>>>> + pte_unmap_unlock(start_pte, ptl); > >>>>>> + start_pte = NULL; > >>>>>> + err = split_folio(folio); > >>>>>> + folio_unlock(folio); > >>>>>> + folio_put(folio); > >>>>>> + if (err) > >>>>>> + continue; > >>>>>> + start_pte = pte = > >>>>>> + pte_offset_map_lock(mm, pmd, addr, &ptl); > >>>>>> + if (!start_pte) > >>>>>> + break; > >>>>>> + arch_enter_lazy_mmu_mode(); > >>>>>> + nr = 0; > >>>>>> + continue; > >>>>>> + } > >>>>>> } > >>>>>> > >>>>>> /* > >>>>>> * Do not interfere with other mappings of this folio and > >>>>>> - * non-LRU folio. > >>>>>> + * non-LRU folio. If we have a large folio at this point, we > >>>>>> + * know it is fully mapped so if its mapcount is the same as its > >>>>>> + * number of pages, it must be exclusive. > >>>>>> */ > >>>>>> - if (!folio_test_lru(folio) || folio_mapcount(folio) != 1) > >>>>>> + if (!folio_test_lru(folio) || > >>>>>> + folio_mapcount(folio) != folio_nr_pages(folio)) > >>>>>> continue; > >>>>> > >>>>> This looks so perfect and is exactly what I wanted to achieve. > >>>>> > >>>>>> > >>>>>> if (pageout_anon_only_filter && !folio_test_anon(folio)) > >>>>>> continue; > >>>>>> > >>>>>> - VM_BUG_ON_FOLIO(folio_test_large(folio), folio); > >>>>>> - > >>>>>> - if (!pageout && pte_young(ptent)) { > >>>>>> - ptent = ptep_get_and_clear_full(mm, addr, pte, > >>>>>> - tlb->fullmm); > >>>>>> - ptent = pte_mkold(ptent); > >>>>>> - set_pte_at(mm, addr, pte, ptent); > >>>>>> - tlb_remove_tlb_entry(tlb, pte, addr); > >>>>>> + if (!pageout) { > >>>>>> + for (; nr != 0; nr--, pte++, addr += PAGE_SIZE) { > >>>>>> + if (ptep_test_and_clear_young(vma, addr, pte)) > >>>>>> + tlb_remove_tlb_entry(tlb, pte, addr); > >>>>>> + } > >>>>> > >>>>> This looks so smart. if it is not pageout, we have increased pte > >>>>> and addr here; so nr is 0 and we don't need to increase again in > >>>>> for (; addr < end; pte += nr, addr += nr * PAGE_SIZE) > >>>>> > >>>>> otherwise, nr won't be 0. so we will increase addr and > >>>>> pte by nr. > >>>> > >>>> Indeed. I'm hoping that Lance is able to follow a similar pattern for > >>>> madvise_free_pte_range(). > >>>> > >>>> > >>>>> > >>>>> > >>>>>> } > >>>>>> > >>>>>> /* > >>>>>> -- > >>>>>> 2.25.1 > >>>>>> > >>>>> > >>>>> Overall, LGTM, > >>>>> > >>>>> Reviewed-by: Barry Song <v-songbaohua@xxxxxxxx> > >>>> > >>>> Thanks! > >>>> > > > > Thanks > > Barry >