As per yesterday's discussion, I'm going to rework this series into a more generic version that covers pagecache folios too. But your comments will still be relevent there so answers below. On 03/08/2023 14:38, David Hildenbrand wrote: > On 27.07.23 16:18, Ryan Roberts wrote: >> This allows batching the rmap removal with folio_remove_rmap_range(), >> which means we avoid spuriously adding a partially unmapped folio to the >> deferred split queue in the common case, which reduces split queue lock >> contention. >> >> Previously each page was removed from the rmap individually with >> page_remove_rmap(). If the first page belonged to a large folio, this >> would cause page_remove_rmap() to conclude that the folio was now >> partially mapped and add the folio to the deferred split queue. But >> subsequent calls would cause the folio to become fully unmapped, meaning >> there is no value to adding it to the split queue. >> >> A complicating factor is that for platforms where MMU_GATHER_NO_GATHER >> is enabled (e.g. s390), __tlb_remove_page() drops a reference to the >> page. This means that the folio reference count could drop to zero while >> still in use (i.e. before folio_remove_rmap_range() is called). This >> does not happen on other platforms because the actual page freeing is >> deferred. >> >> Solve this by appropriately getting/putting the folio to guarrantee it >> does not get freed early. Given the need to get/put the folio in the >> batch path, we stick to the non-batched path if the folio is not large. >> While the batched path is functionally correct for a folio with 1 page, >> it is unlikely to be as efficient as the existing non-batched path in >> this case. >> >> Signed-off-by: Ryan Roberts <ryan.roberts@xxxxxxx> >> --- >> mm/memory.c | 132 ++++++++++++++++++++++++++++++++++++++++++++++++++++ >> 1 file changed, 132 insertions(+) >> >> diff --git a/mm/memory.c b/mm/memory.c >> index 01f39e8144ef..d35bd8d2b855 100644 >> --- a/mm/memory.c >> +++ b/mm/memory.c >> @@ -1391,6 +1391,99 @@ zap_install_uffd_wp_if_needed(struct vm_area_struct *vma, >> pte_install_uffd_wp_if_needed(vma, addr, pte, pteval); >> } >> +static inline unsigned long page_cont_mapped_vaddr(struct page *page, >> + struct page *anchor, unsigned long anchor_vaddr) >> +{ >> + unsigned long offset; >> + unsigned long vaddr; >> + >> + offset = (page_to_pfn(page) - page_to_pfn(anchor)) << PAGE_SHIFT; >> + vaddr = anchor_vaddr + offset; >> + >> + if (anchor > page) { >> + if (vaddr > anchor_vaddr) >> + return 0; >> + } else { >> + if (vaddr < anchor_vaddr) >> + return ULONG_MAX; >> + } >> + >> + return vaddr; >> +} >> + >> +static int folio_nr_pages_cont_mapped(struct folio *folio, >> + struct page *page, pte_t *pte, >> + unsigned long addr, unsigned long end) >> +{ >> + pte_t ptent; >> + int floops; >> + int i; >> + unsigned long pfn; >> + struct page *folio_end; >> + >> + if (!folio_test_large(folio)) >> + return 1; >> + >> + folio_end = &folio->page + folio_nr_pages(folio); >> + end = min(page_cont_mapped_vaddr(folio_end, page, addr), end); >> + floops = (end - addr) >> PAGE_SHIFT; >> + pfn = page_to_pfn(page); >> + pfn++; >> + pte++; >> + >> + for (i = 1; i < floops; i++) { >> + ptent = ptep_get(pte); >> + >> + if (!pte_present(ptent) || pte_pfn(ptent) != pfn) >> + break; >> + >> + pfn++; >> + pte++; >> + } >> + >> + return i; >> +} >> + >> +static unsigned long try_zap_anon_pte_range(struct mmu_gather *tlb, >> + struct vm_area_struct *vma, >> + struct folio *folio, >> + struct page *page, pte_t *pte, >> + unsigned long addr, int nr_pages, >> + struct zap_details *details) >> +{ >> + struct mm_struct *mm = tlb->mm; >> + pte_t ptent; >> + bool full; >> + int i; >> + >> + /* __tlb_remove_page may drop a ref; prevent going to 0 while in use. */ >> + folio_get(folio); > > Is there no way around that? It feels wrong and IMHO a bit ugly. I haven't found a good one, but I'm all ears. On the non-batched path, __tlb_remove_page() is called before page_remove_rmap(). On this path, the whole point is that we just call folio_remove_rmap_range() for the whole batch. If I'm right, we must only remove from the rmap *after* doing the pte clear to avoid races. And we need to do call __tlb_remove_page() as we go, because it might run out of space at any time. If I knew how many pages the tlb could accept ahead of time, I could defer the __tlb_remove_page() calls to after folio_remove_rmap_range(). But there is no accessor for that as far as I can see. It looks fairly complicated to calculate it too. > > With this patch, you'll might suddenly have mapcount > refcount for a folio, or > am I wrong? Yes you would. Does that break things? > >> + >> + for (i = 0; i < nr_pages;) { >> + ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm); >> + tlb_remove_tlb_entry(tlb, pte, addr); >> + zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent); >> + full = __tlb_remove_page(tlb, page, 0); >> + >> + if (unlikely(page_mapcount(page) < 1)) >> + print_bad_pte(vma, addr, ptent, page); > > Can we avoid new users of page_mapcount() outside rmap code, please? :) Sure. This is just trying to replicate the same diagnstics that's done on the non-batched path. I'm happy to remove it. >