On 18 Jul 2023, at 6:19, Ryan Roberts wrote: > On 17/07/2023 17:15, Zi Yan wrote: >> On 17 Jul 2023, at 11:55, Ryan Roberts wrote: >> >>> On 17/07/2023 16:25, Zi Yan wrote: >>>> On 17 Jul 2023, at 10:31, Ryan Roberts wrote: >>>> >>>>> This allows batching the rmap removal with folio_remove_rmap_range(), >>>>> which means we avoid spuriously adding a partially unmapped folio to the >>>>> deferrred split queue in the common case, which reduces split queue lock >>>>> contention. >>>>> >>>>> Previously each page was removed from the rmap individually with >>>>> page_remove_rmap(). If the first page belonged to a large folio, this >>>>> would cause page_remove_rmap() to conclude that the folio was now >>>>> partially mapped and add the folio to the deferred split queue. But >>>>> subsequent calls would cause the folio to become fully unmapped, meaning >>>>> there is no value to adding it to the split queue. >>>>> >>>>> Signed-off-by: Ryan Roberts <ryan.roberts@xxxxxxx> >>>>> --- >>>>> mm/memory.c | 119 ++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>> 1 file changed, 119 insertions(+) >>>>> >>>>> diff --git a/mm/memory.c b/mm/memory.c >>>>> index 01f39e8144ef..6facb8c8807a 100644 >>>>> --- a/mm/memory.c >>>>> +++ b/mm/memory.c >>>>> @@ -1391,6 +1391,95 @@ zap_install_uffd_wp_if_needed(struct vm_area_struct *vma, >>>>> pte_install_uffd_wp_if_needed(vma, addr, pte, pteval); >>>>> } >>>>> >>>>> +static inline unsigned long page_addr(struct page *page, >>>>> + struct page *anchor, unsigned long anchor_addr) >>>>> +{ >>>>> + unsigned long offset; >>>>> + unsigned long addr; >>>>> + >>>>> + offset = (page_to_pfn(page) - page_to_pfn(anchor)) << PAGE_SHIFT; >>>>> + addr = anchor_addr + offset; >>>>> + >>>>> + if (anchor > page) { >>>>> + if (addr > anchor_addr) >>>>> + return 0; >>>>> + } else { >>>>> + if (addr < anchor_addr) >>>>> + return ULONG_MAX; >>>>> + } >>>>> + >>>>> + return addr; >>>>> +} >>>>> + >>>>> +static int calc_anon_folio_map_pgcount(struct folio *folio, >>>>> + struct page *page, pte_t *pte, >>>>> + unsigned long addr, unsigned long end) >>>>> +{ >>>>> + pte_t ptent; >>>>> + int floops; >>>>> + int i; >>>>> + unsigned long pfn; >>>>> + >>>>> + end = min(page_addr(&folio->page + folio_nr_pages(folio), page, addr), >>>>> + end); >>>>> + floops = (end - addr) >> PAGE_SHIFT; >>>>> + pfn = page_to_pfn(page); >>>>> + pfn++; >>>>> + pte++; >>>>> + >>>>> + for (i = 1; i < floops; i++) { >>>>> + ptent = ptep_get(pte); >>>>> + >>>>> + if (!pte_present(ptent) || >>>>> + pte_pfn(ptent) != pfn) { >>>>> + return i; >>>>> + } >>>>> + >>>>> + pfn++; >>>>> + pte++; >>>>> + } >>>>> + >>>>> + return floops; >>>>> +} >>>>> + >>>>> +static unsigned long zap_anon_pte_range(struct mmu_gather *tlb, >>>>> + struct vm_area_struct *vma, >>>>> + struct page *page, pte_t *pte, >>>>> + unsigned long addr, unsigned long end, >>>>> + bool *full_out) >>>>> +{ >>>>> + struct folio *folio = page_folio(page); >>>>> + struct mm_struct *mm = tlb->mm; >>>>> + pte_t ptent; >>>>> + int pgcount; >>>>> + int i; >>>>> + bool full; >>>>> + >>>>> + pgcount = calc_anon_folio_map_pgcount(folio, page, pte, addr, end); >>>>> + >>>>> + for (i = 0; i < pgcount;) { >>>>> + ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm); >>>>> + tlb_remove_tlb_entry(tlb, pte, addr); >>>>> + full = __tlb_remove_page(tlb, page, 0); >>>>> + >>>>> + if (unlikely(page_mapcount(page) < 1)) >>>>> + print_bad_pte(vma, addr, ptent, page); >>>>> + >>>>> + i++; >>>>> + page++; >>>>> + pte++; >>>>> + addr += PAGE_SIZE; >>>>> + >>>>> + if (unlikely(full)) >>>>> + break; >>>>> + } >>>>> + >>>>> + folio_remove_rmap_range(folio, page - i, i, vma); >>>>> + >>>>> + *full_out = full; >>>>> + return i; >>>>> +} >>>>> + >>>>> static unsigned long zap_pte_range(struct mmu_gather *tlb, >>>>> struct vm_area_struct *vma, pmd_t *pmd, >>>>> unsigned long addr, unsigned long end, >>>>> @@ -1428,6 +1517,36 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, >>>>> page = vm_normal_page(vma, addr, ptent); >>>>> if (unlikely(!should_zap_page(details, page))) >>>>> continue; >>>>> + >>>>> + /* >>>>> + * Batch zap large anonymous folio mappings. This allows >>>>> + * batching the rmap removal, which means we avoid >>>>> + * spuriously adding a partially unmapped folio to the >>>>> + * deferrred split queue in the common case, which >>>>> + * reduces split queue lock contention. Require the VMA >>>>> + * to be anonymous to ensure that none of the PTEs in >>>>> + * the range require zap_install_uffd_wp_if_needed(). >>>>> + */ >>>>> + if (page && PageAnon(page) && vma_is_anonymous(vma)) { >>>>> + bool full; >>>>> + int pgcount; >>>>> + >>>>> + pgcount = zap_anon_pte_range(tlb, vma, >>>>> + page, pte, addr, end, &full); >>>> >>>> Are you trying to zap as many ptes as possible if all these ptes are >>>> within a folio? >>> >>> Yes. >>> >>>> If so, why not calculate end before calling zap_anon_pte_range()? >>>> That would make zap_anon_pte_range() simpler. >>> >>> I'm not sure I follow. That's currently done in calc_anon_folio_map_pgcount(). I >>> could move it to here, but I'm not sure that makes things simpler, just puts >>> more code in here and less in there? >> >> Otherwise your zap_anon_pte_range() is really zap_anon_pte_in_folio_range() or >> some other more descriptive name. When I first look at the name, I thought >> PTEs will be zapped until the end. But that is not the case when I look at the >> code. And future users can easily be confused too and use it in a wrong way. > > OK I see your point. OK let me pull the page count calculation into here and > pass it to zap_anon_pte_range(). Then I think we can keep the name as is? Yes. Thanks. > > >> >> BTW, page_addr() needs a better name and is easily confused with existing >> page_address(). > > Yeah... I'll try to think of something for v2. > >> >>> >>>> Also check if page is part of >>>> a large folio first to make sure you can batch. >>> >>> Yeah that's fair. I'd be inclined to put that in zap_anon_pte_range() to short >>> circuit calc_anon_folio_map_pgcount(). But ultimately zap_anon_pte_range() would >>> still zap the single pte. >>> >>> >>>> >>>>> + >>>>> + rss[mm_counter(page)] -= pgcount; >>>>> + pgcount--; >>>>> + pte += pgcount; >>>>> + addr += pgcount << PAGE_SHIFT; >>>>> + >>>>> + if (unlikely(full)) { >>>>> + force_flush = 1; >>>>> + addr += PAGE_SIZE; >>>>> + break; >>>>> + } >>>>> + continue; >>>>> + } >>>>> + >>>>> ptent = ptep_get_and_clear_full(mm, addr, pte, >>>>> tlb->fullmm); >>>>> tlb_remove_tlb_entry(tlb, pte, addr); >>>>> -- >>>>> 2.25.1 >>>> >>>> >>>> -- >>>> Best Regards, >>>> Yan, Zi >> >> >> -- >> Best Regards, >> Yan, Zi -- Best Regards, Yan, Zi
Attachment:
signature.asc
Description: OpenPGP digital signature