On 17/07/2023 17:15, Zi Yan wrote: > On 17 Jul 2023, at 11:55, Ryan Roberts wrote: > >> On 17/07/2023 16:25, Zi Yan wrote: >>> On 17 Jul 2023, at 10:31, Ryan Roberts wrote: >>> >>>> This allows batching the rmap removal with folio_remove_rmap_range(), >>>> which means we avoid spuriously adding a partially unmapped folio to the >>>> deferrred split queue in the common case, which reduces split queue lock >>>> contention. >>>> >>>> Previously each page was removed from the rmap individually with >>>> page_remove_rmap(). If the first page belonged to a large folio, this >>>> would cause page_remove_rmap() to conclude that the folio was now >>>> partially mapped and add the folio to the deferred split queue. But >>>> subsequent calls would cause the folio to become fully unmapped, meaning >>>> there is no value to adding it to the split queue. >>>> >>>> Signed-off-by: Ryan Roberts <ryan.roberts@xxxxxxx> >>>> --- >>>> mm/memory.c | 119 ++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> 1 file changed, 119 insertions(+) >>>> >>>> diff --git a/mm/memory.c b/mm/memory.c >>>> index 01f39e8144ef..6facb8c8807a 100644 >>>> --- a/mm/memory.c >>>> +++ b/mm/memory.c >>>> @@ -1391,6 +1391,95 @@ zap_install_uffd_wp_if_needed(struct vm_area_struct *vma, >>>> pte_install_uffd_wp_if_needed(vma, addr, pte, pteval); >>>> } >>>> >>>> +static inline unsigned long page_addr(struct page *page, >>>> + struct page *anchor, unsigned long anchor_addr) >>>> +{ >>>> + unsigned long offset; >>>> + unsigned long addr; >>>> + >>>> + offset = (page_to_pfn(page) - page_to_pfn(anchor)) << PAGE_SHIFT; >>>> + addr = anchor_addr + offset; >>>> + >>>> + if (anchor > page) { >>>> + if (addr > anchor_addr) >>>> + return 0; >>>> + } else { >>>> + if (addr < anchor_addr) >>>> + return ULONG_MAX; >>>> + } >>>> + >>>> + return addr; >>>> +} >>>> + >>>> +static int calc_anon_folio_map_pgcount(struct folio *folio, >>>> + struct page *page, pte_t *pte, >>>> + unsigned long addr, unsigned long end) >>>> +{ >>>> + pte_t ptent; >>>> + int floops; >>>> + int i; >>>> + unsigned long pfn; >>>> + >>>> + end = min(page_addr(&folio->page + folio_nr_pages(folio), page, addr), >>>> + end); >>>> + floops = (end - addr) >> PAGE_SHIFT; >>>> + pfn = page_to_pfn(page); >>>> + pfn++; >>>> + pte++; >>>> + >>>> + for (i = 1; i < floops; i++) { >>>> + ptent = ptep_get(pte); >>>> + >>>> + if (!pte_present(ptent) || >>>> + pte_pfn(ptent) != pfn) { >>>> + return i; >>>> + } >>>> + >>>> + pfn++; >>>> + pte++; >>>> + } >>>> + >>>> + return floops; >>>> +} >>>> + >>>> +static unsigned long zap_anon_pte_range(struct mmu_gather *tlb, >>>> + struct vm_area_struct *vma, >>>> + struct page *page, pte_t *pte, >>>> + unsigned long addr, unsigned long end, >>>> + bool *full_out) >>>> +{ >>>> + struct folio *folio = page_folio(page); >>>> + struct mm_struct *mm = tlb->mm; >>>> + pte_t ptent; >>>> + int pgcount; >>>> + int i; >>>> + bool full; >>>> + >>>> + pgcount = calc_anon_folio_map_pgcount(folio, page, pte, addr, end); >>>> + >>>> + for (i = 0; i < pgcount;) { >>>> + ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm); >>>> + tlb_remove_tlb_entry(tlb, pte, addr); >>>> + full = __tlb_remove_page(tlb, page, 0); >>>> + >>>> + if (unlikely(page_mapcount(page) < 1)) >>>> + print_bad_pte(vma, addr, ptent, page); >>>> + >>>> + i++; >>>> + page++; >>>> + pte++; >>>> + addr += PAGE_SIZE; >>>> + >>>> + if (unlikely(full)) >>>> + break; >>>> + } >>>> + >>>> + folio_remove_rmap_range(folio, page - i, i, vma); >>>> + >>>> + *full_out = full; >>>> + return i; >>>> +} >>>> + >>>> static unsigned long zap_pte_range(struct mmu_gather *tlb, >>>> struct vm_area_struct *vma, pmd_t *pmd, >>>> unsigned long addr, unsigned long end, >>>> @@ -1428,6 +1517,36 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, >>>> page = vm_normal_page(vma, addr, ptent); >>>> if (unlikely(!should_zap_page(details, page))) >>>> continue; >>>> + >>>> + /* >>>> + * Batch zap large anonymous folio mappings. This allows >>>> + * batching the rmap removal, which means we avoid >>>> + * spuriously adding a partially unmapped folio to the >>>> + * deferrred split queue in the common case, which >>>> + * reduces split queue lock contention. Require the VMA >>>> + * to be anonymous to ensure that none of the PTEs in >>>> + * the range require zap_install_uffd_wp_if_needed(). >>>> + */ >>>> + if (page && PageAnon(page) && vma_is_anonymous(vma)) { >>>> + bool full; >>>> + int pgcount; >>>> + >>>> + pgcount = zap_anon_pte_range(tlb, vma, >>>> + page, pte, addr, end, &full); >>> >>> Are you trying to zap as many ptes as possible if all these ptes are >>> within a folio? >> >> Yes. >> >>> If so, why not calculate end before calling zap_anon_pte_range()? >>> That would make zap_anon_pte_range() simpler. >> >> I'm not sure I follow. That's currently done in calc_anon_folio_map_pgcount(). I >> could move it to here, but I'm not sure that makes things simpler, just puts >> more code in here and less in there? > > Otherwise your zap_anon_pte_range() is really zap_anon_pte_in_folio_range() or > some other more descriptive name. When I first look at the name, I thought > PTEs will be zapped until the end. But that is not the case when I look at the > code. And future users can easily be confused too and use it in a wrong way. OK I see your point. OK let me pull the page count calculation into here and pass it to zap_anon_pte_range(). Then I think we can keep the name as is? > > BTW, page_addr() needs a better name and is easily confused with existing > page_address(). Yeah... I'll try to think of something for v2. > >> >>> Also check if page is part of >>> a large folio first to make sure you can batch. >> >> Yeah that's fair. I'd be inclined to put that in zap_anon_pte_range() to short >> circuit calc_anon_folio_map_pgcount(). But ultimately zap_anon_pte_range() would >> still zap the single pte. >> >> >>> >>>> + >>>> + rss[mm_counter(page)] -= pgcount; >>>> + pgcount--; >>>> + pte += pgcount; >>>> + addr += pgcount << PAGE_SHIFT; >>>> + >>>> + if (unlikely(full)) { >>>> + force_flush = 1; >>>> + addr += PAGE_SIZE; >>>> + break; >>>> + } >>>> + continue; >>>> + } >>>> + >>>> ptent = ptep_get_and_clear_full(mm, addr, pte, >>>> tlb->fullmm); >>>> tlb_remove_tlb_entry(tlb, pte, addr); >>>> -- >>>> 2.25.1 >>> >>> >>> -- >>> Best Regards, >>> Yan, Zi > > > -- > Best Regards, > Yan, Zi