On Thu, Jan 23, 2025 at 10:45:31PM +0100, Peter Zijlstra wrote: > On Wed, Jan 22, 2025 at 11:27:16PM +0000, Roman Gushchin wrote: > > Commit b67fbebd4cf9 ("mmu_gather: Force tlb-flush VM_PFNMAP vmas") > > added a forced tlbflush to tlb_vma_end(), which is required to avoid a > > race between munmap() and unmap_mapping_range(). However it added some > > overhead to other paths where tlb_vma_end() is used, but vmas are not > > removed, e.g. madvise(MADV_DONTNEED). > > > > Fix this by moving the tlb flush out of tlb_end_vma() into > > free_pgtables(), somewhat similar to the stable version of the > > original commit: e.g. stable commit 895428ee124a ("mm: Force TLB flush > > for PFNMAP mappings before unlink_file_vma()"). > > > > diff --git a/mm/memory.c b/mm/memory.c > > index 398c031be9ba..c2a9effb2e32 100644 > > --- a/mm/memory.c > > +++ b/mm/memory.c > > @@ -365,6 +365,13 @@ void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas, > > { > > struct unlink_vma_file_batch vb; > > > > + /* > > + * VM_PFNMAP and VM_MIXEDMAP maps require a special handling here: > > + * force flush TLBs for such ranges to avoid munmap() vs > > + * unmap_mapping_range() races. > > + */ > > + tlb_flush_mmu_pfnmap(tlb); > > + > > do { > > unsigned long addr = vma->vm_start; > > struct vm_area_struct *next; > > So I'm not sure I muc like this name, it is fairly random and does very > much not convey the reason we're calling this. > > Anyway, going back to reading the original commit (because this > changelog isn't helping me much), the problem appears to be that > unlinking the vma will make unmap_mapping_range() skip things (no vma, > nothing to do, etc) return early and bad things happen. > > So am I right in thinking we need this tlb flush before all those > unlink_{anon,file}_vma*() calls, and that the whole free_pgd_range() > that goes and invalidates the page-tables is too late? > > So how about we do something like so instead? Overall looks good to me, except one question (below). > > --- > diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h > index 709830274b75..481a24f2b839 100644 > --- a/include/asm-generic/tlb.h > +++ b/include/asm-generic/tlb.h > @@ -58,6 +58,11 @@ > * Defaults to flushing at tlb_end_vma() to reset the range; helps when > * there's large holes between the VMAs. > * > + * - tlb_free_vma() > + * > + * tlb_free_vma() marks the start of unlinking the vma and freeing > + * page-tables. > + * > * - tlb_remove_table() > * > * tlb_remove_table() is the basic primitive to free page-table directories > @@ -384,7 +389,10 @@ static inline void __tlb_reset_range(struct mmu_gather *tlb) > * Do not reset mmu_gather::vma_* fields here, we do not > * call into tlb_start_vma() again to set them if there is an > * intermediate flush. > + * > + * Except for vma_pfn, that only cares if there's pending TLBI. > */ > + tlb->vma_pfn = 0; > } > > #ifdef CONFIG_MMU_GATHER_NO_RANGE > @@ -449,7 +457,12 @@ tlb_update_vma_flags(struct mmu_gather *tlb, struct vm_area_struct *vma) > */ > tlb->vma_huge = is_vm_hugetlb_page(vma); > tlb->vma_exec = !!(vma->vm_flags & VM_EXEC); > - tlb->vma_pfn = !!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)); > + > + /* > + * Track if there's at least one VM_PFNMAP/VM_MIXEDMAP vma > + * in the tracked range, see tlb_free_vma(). > + */ > + tlb->vma_pfn |= !!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)); > } > > static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb) > @@ -548,23 +561,39 @@ static inline void tlb_start_vma(struct mmu_gather *tlb, struct vm_area_struct * > } > > static inline void tlb_end_vma(struct mmu_gather *tlb, struct vm_area_struct *vma) > +{ > + if (tlb->fullmm || IS_ENABLED(CONFIG_MMU_GATHER_MERGE_VMAS)) > + return; > + > + /* > + * Do a TLB flush and reset the range at VMA boundaries; this avoids > + * the ranges growing with the unused space between consecutive VMAs, > + * but also the mmu_gather::vma_* flags from tlb_start_vma() rely on > + * this. > + */ > + tlb_flush_mmu_tlbonly(tlb); > +} > + > +static inline void tlb_free_vma(struct mmu_gather *tlb, struct vm_area_struct *vma) > { > if (tlb->fullmm) > return; > > /* > * VM_PFNMAP is more fragile because the core mm will not track the > - * page mapcount -- there might not be page-frames for these PFNs after > - * all. Force flush TLBs for such ranges to avoid munmap() vs > - * unmap_mapping_range() races. > + * page mapcount -- there might not be page-frames for these PFNs > + * after all. > + * > + * Specifically() there is a race between munmap() and > + * unmap_mapping_range(), where munmap() will unlink the VMA, such > + * that unmap_mapping_range() will no longer observe the VMA and > + * no-op, without observing the TLBI, returning prematurely. > + * > + * So if we're about to unlink such a VMA, and we have pending > + * TLBI for such a vma, flush things now. > */ > - if (tlb->vma_pfn || !IS_ENABLED(CONFIG_MMU_GATHER_MERGE_VMAS)) { > - /* > - * Do a TLB flush and reset the range at VMA boundaries; this avoids > - * the ranges growing with the unused space between consecutive VMAs. > - */ > + if ((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) && tlb->vma_pfn) > tlb_flush_mmu_tlbonly(tlb); > - } Why do we need to re-check vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP) here? In free_pgtables() we're iterating over multiple vma's. What if the first has no VM_PFNMAP set, but some other do? Idk if it's even possible, but it's not obvious that it's not possible either. Btw, are you going to handle it yourself or you want me to send a v4 based on your version? Thank you!