On Thu, Mar 21, 2024 at 1:38 AM Ryan Roberts <ryan.roberts@xxxxxxx> wrote: > > On 20/03/2024 14:35, Lance Yang wrote: > > On Wed, Mar 20, 2024 at 9:49 PM Ryan Roberts <ryan.roberts@xxxxxxx> wrote: > >> > >> Hi Lance, Barry, > >> > >> Sorry - I totally missed this when you originally sent it! > > > > No worries at all :) > > > >> > >> > >> On 13/03/2024 14:02, Lance Yang wrote: > >>> On Wed, Mar 13, 2024 at 5:03 PM Ryan Roberts <ryan.roberts@xxxxxxx> wrote: > >>>> > >>>> On 13/03/2024 07:19, Barry Song wrote: > >>>>> On Tue, Mar 12, 2024 at 4:01 AM Ryan Roberts <ryan.roberts@xxxxxxx> wrote: > >>>>>> > >>>>>> Rework madvise_cold_or_pageout_pte_range() to avoid splitting any large > >>>>>> folio that is fully and contiguously mapped in the pageout/cold vm > >>>>>> range. This change means that large folios will be maintained all the > >>>>>> way to swap storage. This both improves performance during swap-out, by > >>>>>> eliding the cost of splitting the folio, and sets us up nicely for > >>>>>> maintaining the large folio when it is swapped back in (to be covered in > >>>>>> a separate series). > >>>>>> > >>>>>> Folios that are not fully mapped in the target range are still split, > >>>>>> but note that behavior is changed so that if the split fails for any > >>>>>> reason (folio locked, shared, etc) we now leave it as is and move to the > >>>>>> next pte in the range and continue work on the proceeding folios. > >>>>>> Previously any failure of this sort would cause the entire operation to > >>>>>> give up and no folios mapped at higher addresses were paged out or made > >>>>>> cold. Given large folios are becoming more common, this old behavior > >>>>>> would have likely lead to wasted opportunities. > >>>>>> > >>>>>> While we are at it, change the code that clears young from the ptes to > >>>>>> use ptep_test_and_clear_young(), which is more efficent than > >>>>>> get_and_clear/modify/set, especially for contpte mappings on arm64, > >>>>>> where the old approach would require unfolding/refolding and the new > >>>>>> approach can be done in place. > >>>>>> > >>>>>> Signed-off-by: Ryan Roberts <ryan.roberts@xxxxxxx> > >>>>> > >>>>> This looks so much better than our initial RFC. > >>>>> Thank you for your excellent work! > >>>> > >>>> Thanks - its a team effort - I had your PoC and David's previous batching work > >>>> to use as a template. > >>>> > >>>>> > >>>>>> --- > >>>>>> mm/madvise.c | 89 ++++++++++++++++++++++++++++++---------------------- > >>>>>> 1 file changed, 51 insertions(+), 38 deletions(-) > >>>>>> > >>>>>> diff --git a/mm/madvise.c b/mm/madvise.c > >>>>>> index 547dcd1f7a39..56c7ba7bd558 100644 > >>>>>> --- a/mm/madvise.c > >>>>>> +++ b/mm/madvise.c > >>>>>> @@ -336,6 +336,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, > >>>>>> LIST_HEAD(folio_list); > >>>>>> bool pageout_anon_only_filter; > >>>>>> unsigned int batch_count = 0; > >>>>>> + int nr; > >>>>>> > >>>>>> if (fatal_signal_pending(current)) > >>>>>> return -EINTR; > >>>>>> @@ -423,7 +424,8 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, > >>>>>> return 0; > >>>>>> flush_tlb_batched_pending(mm); > >>>>>> arch_enter_lazy_mmu_mode(); > >>>>>> - for (; addr < end; pte++, addr += PAGE_SIZE) { > >>>>>> + for (; addr < end; pte += nr, addr += nr * PAGE_SIZE) { > >>>>>> + nr = 1; > >>>>>> ptent = ptep_get(pte); > >>>>>> > >>>>>> if (++batch_count == SWAP_CLUSTER_MAX) { > >>>>>> @@ -447,55 +449,66 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, > >>>>>> continue; > >>>>>> > >>>>>> /* > >>>>>> - * Creating a THP page is expensive so split it only if we > >>>>>> - * are sure it's worth. Split it if we are only owner. > >>>>>> + * If we encounter a large folio, only split it if it is not > >>>>>> + * fully mapped within the range we are operating on. Otherwise > >>>>>> + * leave it as is so that it can be swapped out whole. If we > >>>>>> + * fail to split a folio, leave it in place and advance to the > >>>>>> + * next pte in the range. > >>>>>> */ > >>>>>> if (folio_test_large(folio)) { > >>>>>> - int err; > >>>>>> - > >>>>>> - if (folio_estimated_sharers(folio) > 1) > >>>>>> - break; > >>>>>> - if (pageout_anon_only_filter && !folio_test_anon(folio)) > >>>>>> - break; > >>>>>> - if (!folio_trylock(folio)) > >>>>>> - break; > >>>>>> - folio_get(folio); > >>>>>> - arch_leave_lazy_mmu_mode(); > >>>>>> - pte_unmap_unlock(start_pte, ptl); > >>>>>> - start_pte = NULL; > >>>>>> - err = split_folio(folio); > >>>>>> - folio_unlock(folio); > >>>>>> - folio_put(folio); > >>>>>> - if (err) > >>>>>> - break; > >>>>>> - start_pte = pte = > >>>>>> - pte_offset_map_lock(mm, pmd, addr, &ptl); > >>>>>> - if (!start_pte) > >>>>>> - break; > >>>>>> - arch_enter_lazy_mmu_mode(); > >>>>>> - pte--; > >>>>>> - addr -= PAGE_SIZE; > >>>>>> - continue; > >>>>>> + const fpb_t fpb_flags = FPB_IGNORE_DIRTY | > >>>>>> + FPB_IGNORE_SOFT_DIRTY; > >>>>>> + int max_nr = (end - addr) / PAGE_SIZE; > >>>>>> + > >>>>>> + nr = folio_pte_batch(folio, addr, pte, ptent, max_nr, > >>>>>> + fpb_flags, NULL); > >>>>> > >>>>> I wonder if we have a quick way to avoid folio_pte_batch() if users > >>>>> are doing madvise() on a portion of a large folio. > >>>> > >>>> Good idea. Something like this?: > >>>> > >>>> if (pte_pfn(pte) == folio_pfn(folio) > >>>> nr = folio_pte_batch(folio, addr, pte, ptent, max_nr, > >>>> fpb_flags, NULL); > >>>> > >>>> If we are not mapping the first page of the folio, then it can't be a full > >>>> mapping, so no need to call folio_pte_batch(). Just split it. > >>> > >>> if (folio_test_large(folio)) { > >>> [...] > >>> nr = folio_pte_batch(folio, addr, pte, ptent, max_nr, > >>> fpb_flags, NULL); > >>> + if (folio_estimated_sharers(folio) > 1) > >>> + continue; > >>> > >>> Could we use folio_estimated_sharers as an early exit point here? > >> > >> I'm not sure what this is saving where you have it? Did you mean to put it > >> before folio_pte_batch()? Currently it is just saving a single conditional. > > > > Apologies for the confusion. I made a diff to provide clarity. > > > > diff --git a/mm/madvise.c b/mm/madvise.c > > index 56c7ba7bd558..c3458fdea82a 100644 > > --- a/mm/madvise.c > > +++ b/mm/madvise.c > > @@ -462,12 +462,11 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, > > > > nr = folio_pte_batch(folio, addr, pte, ptent, max_nr, > > fpb_flags, NULL); > > - > > // Could we use folio_estimated_sharers as an early exit point here? > > + if (folio_estimated_sharers(folio) > 1) > > + continue; > > if (nr < folio_nr_pages(folio)) { > > int err; > > > > - if (folio_estimated_sharers(folio) > 1) > > - continue; > > if (pageout_anon_only_filter && > > !folio_test_anon(folio)) > > continue; > > if (!folio_trylock(folio)) > > > I'm still not really getting it; with my code, if nr < the folio size, we will > try to split and if we estimate that the folio is not exclusive we will avoid > locking the folio, etc. If nr == folio size, we will proceed to the precise > exclusivity check (which is cheap once we know the folio is fully mapped by this > process). > > With your change, we will always do the estimated exclusive check then proceed > to the precise check; seems like duplication to me? Agreed. The estimated exclusive check is indeed redundant with my change. > > > > >> > >> But now that I think about it a bit more, I remember why I was originally > >> unconditionally calling folio_pte_batch(). Given its a large folio, if the split > >> fails, we can move the cursor to the pte where the next folio begins so we don't > >> have to iterate through one pte at a time which would cause us to keep calling > >> folio_estimated_sharers(), folio_test_anon(), etc on the same folio until we get > >> to the next boundary. > >> > >> Of course the common case at this point will be for the split to succeed, but > >> then we are going to iterate over ever single PTE anyway - one way or another > >> they are all fetched into cache. So I feel like its neater not to add the > >> conditionals for calling folio_pte_batch(), and just leave this as I have it here. > >> > >>> > >>> if (nr < folio_nr_pages(folio)) { > >>> int err; > >>> > >>> - if (folio_estimated_sharers(folio) > 1) > >>> - continue; > >>> [...] > >>> > >>>> > >>>>> > >>>>>> + > >>>>>> + if (nr < folio_nr_pages(folio)) { > >>>>>> + int err; > >>>>>> + > >>>>>> + if (folio_estimated_sharers(folio) > 1) > >>>>>> + continue; > >>>>>> + if (pageout_anon_only_filter && !folio_test_anon(folio)) > >>>>>> + continue; > >>>>>> + if (!folio_trylock(folio)) > >>>>>> + continue; > >>>>>> + folio_get(folio); > >>>>>> + arch_leave_lazy_mmu_mode(); > >>>>>> + pte_unmap_unlock(start_pte, ptl); > >>>>>> + start_pte = NULL; > >>>>>> + err = split_folio(folio); > >>>>>> + folio_unlock(folio); > >>>>>> + folio_put(folio); > >>>>>> + if (err) > >>>>>> + continue; > >>>>>> + start_pte = pte = > >>>>>> + pte_offset_map_lock(mm, pmd, addr, &ptl); > >>>>>> + if (!start_pte) > >>>>>> + break; > >>>>>> + arch_enter_lazy_mmu_mode(); > >>>>>> + nr = 0; > >>>>>> + continue; > >>>>>> + } > >>>>>> } > >>>>>> > >>>>>> /* > >>>>>> * Do not interfere with other mappings of this folio and > >>>>>> - * non-LRU folio. > >>>>>> + * non-LRU folio. If we have a large folio at this point, we > >>>>>> + * know it is fully mapped so if its mapcount is the same as its > >>>>>> + * number of pages, it must be exclusive. > >>>>>> */ > >>>>>> - if (!folio_test_lru(folio) || folio_mapcount(folio) != 1) > >>>>>> + if (!folio_test_lru(folio) || > >>>>>> + folio_mapcount(folio) != folio_nr_pages(folio)) > >>>>>> continue; > >>>>> > >>>>> This looks so perfect and is exactly what I wanted to achieve. > >>>>> > >>>>>> > >>>>>> if (pageout_anon_only_filter && !folio_test_anon(folio)) > >>>>>> continue; > >>>>>> > >>>>>> - VM_BUG_ON_FOLIO(folio_test_large(folio), folio); > >>>>>> - > >>>>>> - if (!pageout && pte_young(ptent)) { > >>>>>> - ptent = ptep_get_and_clear_full(mm, addr, pte, > >>>>>> - tlb->fullmm); > >>>>>> - ptent = pte_mkold(ptent); > >>>>>> - set_pte_at(mm, addr, pte, ptent); > >>>>>> - tlb_remove_tlb_entry(tlb, pte, addr); > >>>>>> + if (!pageout) { > >>>>>> + for (; nr != 0; nr--, pte++, addr += PAGE_SIZE) { > >>>>>> + if (ptep_test_and_clear_young(vma, addr, pte)) > >>>>>> + tlb_remove_tlb_entry(tlb, pte, addr); > >>> > >>> IIRC, some of the architecture(ex, PPC) don't update TLB with set_pte_at and > >>> tlb_remove_tlb_entry. So, didn't we consider remapping the PTE with old after > >>> pte clearing? > >> > >> Sorry Lance, I don't understand this question, can you rephrase? Are you saying > >> there is a good reason to do the original clear-mkold-set for some arches? > > > > IIRC, some of the architecture(ex, PPC) don't update TLB with > > ptep_test_and_clear_young() > > and tlb_remove_tlb_entry(). > > Err, I assumed tlb_remove_tlb_entry() meant "invalidate the TLB entry for this > address please" - albeit its deferred and batched. I'll look into this. > > > > > In my new patch[1], I use refresh_full_ptes() and > > tlb_remove_tlb_entries() to batch-update the > > access and dirty bits. > > I want to avoid the per-pte clear-modify-set approach, because this doesn't > perform well on arm64 when using contpte mappings; it will cause the contpe > mapping to be unfolded by the first clear that touches the contpte block, then > refolded by the last set to touch the block. That's expensive. > ptep_test_and_clear_young() doesn't suffer that problem. Thanks for explaining. I got it. I think that other architectures will benefit from the per-pte clear-modify-set approach. IMO, refresh_full_ptes() can be overridden by arm64. Thanks, Lance > > > > > [1] https://lore.kernel.org/linux-mm/20240316102952.39233-1-ioworker0@xxxxxxxxx > > > > Thanks, > > Lance > > > >> > >>> > >>> Thanks, > >>> Lance > >>> > >>> > >>> > >>>>>> + } > >>>>> > >>>>> This looks so smart. if it is not pageout, we have increased pte > >>>>> and addr here; so nr is 0 and we don't need to increase again in > >>>>> for (; addr < end; pte += nr, addr += nr * PAGE_SIZE) > >>>>> > >>>>> otherwise, nr won't be 0. so we will increase addr and > >>>>> pte by nr. > >>>> > >>>> Indeed. I'm hoping that Lance is able to follow a similar pattern for > >>>> madvise_free_pte_range(). > >>>> > >>>> > >>>>> > >>>>> > >>>>>> } > >>>>>> > >>>>>> /* > >>>>>> -- > >>>>>> 2.25.1 > >>>>>> > >>>>> > >>>>> Overall, LGTM, > >>>>> > >>>>> Reviewed-by: Barry Song <v-songbaohua@xxxxxxxx> > >>>> > >>>> Thanks! > >>>> > >>>> > >> >