On Jul 7, 2022, at 10:56 PM, Nadav Amit <nadav.amit@xxxxxxxxx> wrote: > On Jul 7, 2022, at 9:23 PM, Nadav Amit <nadav.amit@xxxxxxxxx> wrote: > >> On Jul 7, 2022, at 8:27 PM, Hugh Dickins <hughd@xxxxxxxxxx> wrote: >> >>> On Mon, 6 Jun 2022, Nadav Amit wrote: >>> >>>> From: Nadav Amit <namit@xxxxxxxxxx> >>>> >>>> On extreme TLB shootdown storms, the mm's tlb_gen cacheline is highly >>>> contended and reading it should (arguably) be avoided as much as >>>> possible. >>>> >>>> Currently, flush_tlb_func() reads the mm's tlb_gen unconditionally, >>>> even when it is not necessary (e.g., the mm was already switched). >>>> This is wasteful. >>>> >>>> Moreover, one of the existing optimizations is to read mm's tlb_gen to >>>> see if there are additional in-flight TLB invalidations and flush the >>>> entire TLB in such a case. However, if the request's tlb_gen was already >>>> flushed, the benefit of checking the mm's tlb_gen is likely to be offset >>>> by the overhead of the check itself. >>>> >>>> Running will-it-scale with tlb_flush1_threads show a considerable >>>> benefit on 56-core Skylake (up to +24%): >>>> >>>> threads Baseline (v5.17+) +Patch >>>> 1 159960 160202 >>>> 5 310808 308378 (-0.7%) >>>> 10 479110 490728 >>>> 15 526771 562528 >>>> 20 534495 587316 >>>> 25 547462 628296 >>>> 30 579616 666313 >>>> 35 594134 701814 >>>> 40 612288 732967 >>>> 45 617517 749727 >>>> 50 637476 735497 >>>> 55 614363 778913 (+24%) >>>> >>>> Acked-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx> >>>> Cc: Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx> >>>> Cc: Ingo Molnar <mingo@xxxxxxxxxx> >>>> Cc: Andy Lutomirski <luto@xxxxxxxxxx> >>>> Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx> >>>> Cc: x86@xxxxxxxxxx >>>> Signed-off-by: Nadav Amit <namit@xxxxxxxxxx> >>>> >>>> -- >>>> >>>> Note: The benchmarked kernels include Dave's revert of commit >>>> 6035152d8eeb ("x86/mm/tlb: Open-code on_each_cpu_cond_mask() for >>>> tlb_is_not_lazy() >>>> --- >>>> arch/x86/mm/tlb.c | 18 +++++++++++++++++- >>>> 1 file changed, 17 insertions(+), 1 deletion(-) >>>> >>>> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c >>>> index d400b6d9d246..d9314cc8b81f 100644 >>>> --- a/arch/x86/mm/tlb.c >>>> +++ b/arch/x86/mm/tlb.c >>>> @@ -734,10 +734,10 @@ static void flush_tlb_func(void *info) >>>> const struct flush_tlb_info *f = info; >>>> struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm); >>>> u32 loaded_mm_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid); >>>> - u64 mm_tlb_gen = atomic64_read(&loaded_mm->context.tlb_gen); >>>> u64 local_tlb_gen = this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].tlb_gen); >>>> bool local = smp_processor_id() == f->initiating_cpu; >>>> unsigned long nr_invalidate = 0; >>>> + u64 mm_tlb_gen; >>>> >>>> /* This code cannot presently handle being reentered. */ >>>> VM_WARN_ON(!irqs_disabled()); >>>> @@ -771,6 +771,22 @@ static void flush_tlb_func(void *info) >>>> return; >>>> } >>>> >>>> + if (f->new_tlb_gen <= local_tlb_gen) { >>>> + /* >>>> + * The TLB is already up to date in respect to f->new_tlb_gen. >>>> + * While the core might be still behind mm_tlb_gen, checking >>>> + * mm_tlb_gen unnecessarily would have negative caching effects >>>> + * so avoid it. >>>> + */ >>>> + return; >>>> + } >>>> + >>>> + /* >>>> + * Defer mm_tlb_gen reading as long as possible to avoid cache >>>> + * contention. >>>> + */ >>>> + mm_tlb_gen = atomic64_read(&loaded_mm->context.tlb_gen); >>>> + >>>> if (unlikely(local_tlb_gen == mm_tlb_gen)) { >>>> /* >>>> * There's nothing to do: we're already up to date. This can >>>> -- >>>> 2.25.1 >>> >>> I'm sorry, but bisection and reversion show that this commit, >>> aa44284960d550eb4d8614afdffebc68a432a9b4 in current linux-next, >>> is responsible for the "internal compiler error: Segmentation fault"s >>> I get when running kernel builds on tmpfs in 1G memory, lots of swapping. >>> >>> That tmpfs is using huge pages as much as it can, so splitting and >>> collapsing, compaction and page migration entailed, in case that's >>> relevant (maybe this commit is perfect, but there's a TLB flushing >>> bug over there in mm which this commit just exposes). >>> >>> Whether those segfaults happen without the huge page element, >>> I have not done enough testing to tell - there are other bugs with >>> swapping in current linux-next, indeed, I wouldn't even have found >>> this one, if I hadn't already been on a bisection for another bug, >>> and got thrown off course by these segfaults. >>> >>> I hope that you can work out what might be wrong with this, >>> but meantime I think it needs to be reverted. >> >> I find it always surprising how trivial one liners fail. >> >> As you probably know, debugging these kind of things is hard. I see two >> possible cases: >> >> 1. The failure is directly related to this optimization. The immediate >> suspect in my mind is something to do with PCID/ASID. >> >> 2. The failure is due to another bug that was papered by “enough” TLB >> flushes. >> >> I will look into the code. But if it is possible, it would be helpful to >> know whether you get the failure with the “nopcid” kernel parameter. If it >> passes, it wouldn’t say much, but if it fails, I think (2) is more likely. >> >> Not arguing about a revert, but, in some way, if the test fails, it can >> indicate that the optimization “works”… >> >> I’ll put some time to look deeper into the code, but it would be very >> helpful if you can let me know what happens with nopcid. > > Actually, only using “nopcid” would most likely make it go away if we have > PTI enabled. So to get a good indication, a check whether it reproduces with > “nopti” and “nopcid” is needed. > > I don’t have a better answer yet. Still trying to see what might have gone > wrong. Ok. My bad. Sorry. arch_tlbbatch_flush() does not set any generation in flush_tlb_info. Bad. Should be fixed by something like - I’ll send a patch tomorrow: diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index d9314cc8b81f..9f19894c322f 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -771,7 +771,7 @@ static void flush_tlb_func(void *info) return; } - if (f->new_tlb_gen <= local_tlb_gen) { + if (unlikely(f->mm && f->new_tlb_gen <= local_tlb_gen)) { /* * The TLB is already up to date in respect to f->new_tlb_gen. * While the core might be still behind mm_tlb_gen, checking