Re: [PATCH] x86/mm: Fix flush_tlb_range() when used for zapping normal PMDs

Jann Horn <jannh@xxxxxxxxxx> · Mon, 6 Jan 2025 14:25:54 +0100

On Sat, Jan 4, 2025 at 4:09 AM Rik van Riel <riel@xxxxxxxxxxx> wrote:
> On Fri, 2025-01-03 at 23:11 +0100, Jann Horn wrote:
> > On Fri, Jan 3, 2025 at 10:55 PM Rik van Riel <riel@xxxxxxxxxxx>
> > wrote:
> > > On Fri, 2025-01-03 at 19:39 +0100, Jann Horn wrote:
> > > > 02fc2aa06e9e0ecdba3fe948cafe5892b72e86c0..3da645139748538daac7016
> > > > 6618d
> > > > 8ad95116eb74 100644
> > > > --- a/arch/x86/include/asm/tlbflush.h
> > > > +++ b/arch/x86/include/asm/tlbflush.h
> > > > @@ -242,7 +242,7 @@ void flush_tlb_multi(const struct cpumask
> > > > *cpumask,
> > > >       flush_tlb_mm_range((vma)->vm_mm, start,
> > > > end,                  \
> > > >                          ((vma)->vm_flags &
> > > > VM_HUGETLB)           \
> > > >                               ?
> > > > huge_page_shift(hstate_vma(vma))      \
> > > > -                             : PAGE_SHIFT, false)
> > > > +                             : PAGE_SHIFT, true)
> > > >
> > > >
> > >
> > > The code looks good, but should this macro get
> > > a comment indicating that code that only frees
> > > pages, but not page tables, should be calling
> > > flush_tlb() instead?
> >
> > Documentation/core-api/cachetlb.rst seems to be the common place
> > that's supposed to document the rules - the macro I'm touching is
> > just
> > the x86 implementation. (The arm64 implementation also has some
> > fairly
> > extensive comments that say flush_tlb_range() "also invalidates any
> > walk-cache entries associated with translations for the specified
> > address range" while flush_tlb_page() "only invalidates a single,
> > last-level page-table entry and therefore does not affect any
> > walk-caches".) I wouldn't want to add yet more documentation for this
> > API inside the X86 code. I guess it would make sense to add pointers
> > from the x86 code to the documentation (and copy the details about
> > last-level TLBs from the arm64 code into the docs).
> >
> > I don't see a function flush_tlb() outside of some (non-x86) arch
> > code.
>
> I see zap_pte_range() calling tlb_flush_mmu(),
> which calls tlb_flush_mmu_tlbonly() in include/asm-generic/tlb.h,
> which in turn calls tlb_flush().
>
> The asm-generic version of tlb_flush() goes through
> flush_tlb_mm(), which on x86 would call flush_tlb_mm_range
> with flush_tables = true.
>
> Luckily x86 seems to have its own implementation of
> tlb_flush(), which avoids that issue.

Aah, right. Yeah, I think the tlb_flush() infrastructure with "struct
mmu_gather" is probably one of the two really optimized TLB flushing
hotpaths (the other one being the reclaim path).

I think tlb_flush() is for somewhat different use cases though - my
understanding is that it is mainly for operations that need batching
and/or want to delay TLB flushes while dropping page table locks.

> > I don't know if it makes sense to tell developers to not use
> > flush_tlb_range() for freeing pages. If the performance of
> > flush_tlb_range() actually is an issue, I guess one fix would be to
> > refactor this and add a parameter or something?
> >
>
> I don't know whether this is a real issue on
> architectures other than x86.

arm64 seems to have code specifically for doing flushes without
affecting cached higher-level entries - __flush_tlb_range_nosync()
receives a "last_level" parameter (which is plumbed through from the
arm64 version of tlb_flush()) and picks "vale1is" or "vae1is"
depending on it.

> For now it looks like the code does the right
> thing when only pages are being freed, so we
> may not need that parameter.
>
> --
> All Rights Reversed.