On Mon, 09 Dec 2024 13:04:43 +0100 Valentin Schneider <vschneid@xxxxxxxxxx> wrote: > On 05/12/24 18:31, Petr Tesarik wrote: > > On Thu, 21 Nov 2024 16:30:16 +0100 > > Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote: > > > >> On Thu, Nov 21, 2024 at 07:07:44AM -0800, Dave Hansen wrote: > >> > On 11/21/24 03:12, Peter Zijlstra wrote: > >> > >> I see e.g. ds_clear_cea() clears PTEs that can have the _PAGE_GLOBAL flag, > >> > >> and it correctly uses the non-deferrable flush_tlb_kernel_range(). > >> > > > >> > > I always forget what we use global pages for, dhansen might know, but > >> > > let me try and have a look. > >> > > > >> > > I *think* we only have GLOBAL on kernel text, and that only sometimes. > >> > > >> > I think you're remembering how _PAGE_GLOBAL gets used when KPTI is in play. > >> > >> Yah, I suppose I am. That was the last time I had a good look at this > >> stuff :-) > >> > >> > Ignoring KPTI for a sec... We use _PAGE_GLOBAL for all kernel mappings. > >> > Before PCIDs, global mappings let the kernel TLB entries live across CR3 > >> > writes. When PCIDs are in play, global mappings let two different ASIDs > >> > share TLB entries. > >> > >> Hurmph.. bah. That means we do need that horrible CR4 dance :/ > > > > In general, yes. > > > > But I wonder what exactly was the original scenario encountered by > > Valentin. I mean, if TLB entry invalidations were necessary to sync > > changes to kernel text after flipping a static branch, then it might be > > less overhead to make a list of affected pages and call INVLPG on them. > > > > AFAIK there is currently no such IPI function for doing that, but if we > > could add one. If the list of invalidated global pages is reasonably > > short, of course. > > > > Valentin, do you happen to know? > > > > So from my experimentation (hackbench + kernel compilation on housekeeping > CPUs, dummy while(1) userspace loop on isolated CPUs), the TLB flushes only > occurred from vunmap() - mainly from all the hackbench threads coming and > going. > > Static branch updates only seem to trigger the sync_core() IPI, at least on > x86. Thank you, this is helpful. So, these allocations span more than tlb_single_page_flush_ceiling pages (default 33). Is THP enabled? If yes, we could possibly get below that threshold by improving flushing of huge pages (cf. footnote [1] in Documentation/arch/x86/tlb.rst). OTOH even though a series of INVLPG may reduce subsequent TLB misses, it will not exactly improve latency, so it would go against the main goal of this whole patch series. Hmmm... I see, the CR4 dance is the best solution after all. :-| Petr T