Re: [RFC PATCH v3 13/15] context_tracking,x86: Add infrastructure to defer kernel TLBI

Petr Tesarik <ptesarik@xxxxxxxx> · Mon, 9 Dec 2024 13:33:09 +0100

On Mon, 09 Dec 2024 13:04:43 +0100
Valentin Schneider <vschneid@xxxxxxxxxx> wrote:

> On 05/12/24 18:31, Petr Tesarik wrote:
> > On Thu, 21 Nov 2024 16:30:16 +0100
> > Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> >  
> >> On Thu, Nov 21, 2024 at 07:07:44AM -0800, Dave Hansen wrote:  
> >> > On 11/21/24 03:12, Peter Zijlstra wrote:  
> >> > >> I see e.g. ds_clear_cea() clears PTEs that can have the _PAGE_GLOBAL flag,
> >> > >> and it correctly uses the non-deferrable flush_tlb_kernel_range().  
> >> > >
> >> > > I always forget what we use global pages for, dhansen might know, but
> >> > > let me try and have a look.
> >> > >
> >> > > I *think* we only have GLOBAL on kernel text, and that only sometimes.  
> >> >
> >> > I think you're remembering how _PAGE_GLOBAL gets used when KPTI is in play.  
> >>
> >> Yah, I suppose I am. That was the last time I had a good look at this
> >> stuff :-)
> >>  
> >> > Ignoring KPTI for a sec... We use _PAGE_GLOBAL for all kernel mappings.
> >> > Before PCIDs, global mappings let the kernel TLB entries live across CR3
> >> > writes. When PCIDs are in play, global mappings let two different ASIDs
> >> > share TLB entries.  
> >>
> >> Hurmph.. bah. That means we do need that horrible CR4 dance :/  
> >
> > In general, yes.
> >
> > But I wonder what exactly was the original scenario encountered by
> > Valentin. I mean, if TLB entry invalidations were necessary to sync
> > changes to kernel text after flipping a static branch, then it might be
> > less overhead to make a list of affected pages and call INVLPG on them.
> >
> > AFAIK there is currently no such IPI function for doing that, but if we
> > could add one. If the list of invalidated global pages is reasonably
> > short, of course.
> >
> > Valentin, do you happen to know?
> >  
> 
> So from my experimentation (hackbench + kernel compilation on housekeeping
> CPUs, dummy while(1) userspace loop on isolated CPUs), the TLB flushes only
> occurred from vunmap() - mainly from all the hackbench threads coming and
> going.
> 
> Static branch updates only seem to trigger the sync_core() IPI, at least on
> x86.

Thank you, this is helpful.

So, these allocations span more than tlb_single_page_flush_ceiling
pages (default 33). Is THP enabled? If yes, we could possibly get below
that threshold by improving flushing of huge pages (cf. footnote [1] in
Documentation/arch/x86/tlb.rst).

OTOH even though a series of INVLPG may reduce subsequent TLB misses,
it will not exactly improve latency, so it would go against the main
goal of this whole patch series.

Hmmm... I see, the CR4 dance is the best solution after all. :-|

Petr T