Re: [PATCH v2] mmu_gather: move tlb flush for VM_PFNMAP/VM_MIXEDMAP vmas into free_pgtables()

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, 23 Jan 2025, Roman Gushchin wrote:
> On Thu, Jan 23, 2025 at 10:45:31PM +0100, Peter Zijlstra wrote:
> > On Wed, Jan 22, 2025 at 11:27:16PM +0000, Roman Gushchin wrote:
> > > Commit b67fbebd4cf9 ("mmu_gather: Force tlb-flush VM_PFNMAP vmas")
> > > added a forced tlbflush to tlb_vma_end(), which is required to avoid a
> > > race between munmap() and unmap_mapping_range(). However it added some
> > > overhead to other paths where tlb_vma_end() is used, but vmas are not
> > > removed, e.g. madvise(MADV_DONTNEED).
> > > 
> > > Fix this by moving the tlb flush out of tlb_end_vma() into
> > > free_pgtables(), somewhat similar to the stable version of the
> > > original commit: e.g. stable commit 895428ee124a ("mm: Force TLB flush
> > > for PFNMAP mappings before unlink_file_vma()").
> > 
> > 
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 398c031be9ba..c2a9effb2e32 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -365,6 +365,13 @@ void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas,
> > >  {
> > >  	struct unlink_vma_file_batch vb;
> > >  
> > > +	/*
> > > +	 * VM_PFNMAP and VM_MIXEDMAP maps require a special handling here:
> > > +	 * force flush TLBs for such ranges to avoid munmap() vs
> > > +	 * unmap_mapping_range() races.
> > > +	 */
> > > +	tlb_flush_mmu_pfnmap(tlb);
> > > +
> > >  	do {
> > >  		unsigned long addr = vma->vm_start;
> > >  		struct vm_area_struct *next;
> > 
> > So I'm not sure I muc like this name, it is fairly random and does very
> > much not convey the reason we're calling this.
> > 
> > Anyway, going back to reading the original commit (because this
> > changelog isn't helping me much), the problem appears to be that
> > unlinking the vma will make unmap_mapping_range() skip things (no vma,
> > nothing to do, etc) return early and bad things happen.

The changelog of commit b67fbebd4cf9 ("mmu_gather: Force tlb-flush
VM_PFNMAP vmas") has not helped me either.  Nor could I locate any
discussion (Jann, Linus, Peter, Will?) that led up to it.

I can see that a racing unmap_mapping_range() may complete before
the munmap() has done its TLB flush, but (a) so what? and (b) how
does VM_PFNMAP or page_mapcount affect that? and (c) did Linus's
later delay_rmap changes make any difference to the story here.

Jann, please spell it out for us: I think I'm not the only one who
fails to understand the race in question.  At present I'm hovering
between thinking there was no bug to be fixed in the first place, or
that a tlb_flush_mmu_tlbonly() has to be done before unlinking vmas
in all cases (or in all file cases?).

I'm probably wrong in both directions, but cannot see it without help.

> > 
> > So am I right in thinking we need this tlb flush before all those
> > unlink_{anon,file}_vma*() calls, and that the whole free_pgd_range()
> > that goes and invalidates the page-tables is too late?
> > 
> > So how about we do something like so instead?
> 
> Overall looks good to me, except one question (below).

To me, Peter's patch looks much like yours, except wth different
names and comments, plus the "vma" error you point out below.

> > +static inline void tlb_free_vma(struct mmu_gather *tlb, struct vm_area_struct *vma)
> >  {
> >  	if (tlb->fullmm)
> >  		return;
> >  
> >  	/*
> >  	 * VM_PFNMAP is more fragile because the core mm will not track the
> > -	 * page mapcount -- there might not be page-frames for these PFNs after
> > -	 * all. Force flush TLBs for such ranges to avoid munmap() vs
> > -	 * unmap_mapping_range() races.
> > +	 * page mapcount -- there might not be page-frames for these PFNs
> > +	 * after all.
> > +	 *
> > +	 * Specifically() there is a race between munmap() and
> > +	 * unmap_mapping_range(), where munmap() will unlink the VMA, such
> > +	 * that unmap_mapping_range() will no longer observe the VMA and
> > +	 * no-op, without observing the TLBI, returning prematurely.
> > +	 *
> > +	 * So if we're about to unlink such a VMA, and we have pending
> > +	 * TLBI for such a vma, flush things now.
> >  	 */
> > -	if (tlb->vma_pfn || !IS_ENABLED(CONFIG_MMU_GATHER_MERGE_VMAS)) {
> > -		/*
> > -		 * Do a TLB flush and reset the range at VMA boundaries; this avoids
> > -		 * the ranges growing with the unused space between consecutive VMAs.
> > -		 */
> > +	if ((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) && tlb->vma_pfn)
> >  		tlb_flush_mmu_tlbonly(tlb);
> > -	}
> 
> Why do we need to re-check vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP) here?
> 
> In free_pgtables() we're iterating over multiple vma's. What if the first has
> no VM_PFNMAP set, but some other do? Idk if it's even possible, but it's not
> obvious that it's not possible either.

Yes, of course that is possible: the "vma" to tlb_free_vma() is a mistake
(hmm, and there's no "free" in it either).

Hugh




[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux