On Sun, Jan 12, 2025 at 4:55 PM Rik van Riel <riel@xxxxxxxxxxx> wrote: > Instead of doing a system-wide TLB flush from arch_tlbbatch_flush, > queue up asynchronous, targeted flushes from arch_tlbbatch_add_pending. > > This also allows us to avoid adding the CPUs of processes using broadcast > flushing to the batch->cpumask, and will hopefully further reduce TLB > flushing from the reclaim and compaction paths. [...] > diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c > index 80375ef186d5..532911fbb12a 100644 > --- a/arch/x86/mm/tlb.c > +++ b/arch/x86/mm/tlb.c > @@ -1658,9 +1658,7 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch) > * a local TLB flush is needed. Optimize this use-case by calling > * flush_tlb_func_local() directly in this case. > */ > - if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) { > - invlpgb_flush_all_nonglobals(); > - } else if (cpumask_any_but(&batch->cpumask, cpu) < nr_cpu_ids) { > + if (cpumask_any_but(&batch->cpumask, cpu) < nr_cpu_ids) { > flush_tlb_multi(&batch->cpumask, info); > } else if (cpumask_test_cpu(cpu, &batch->cpumask)) { > lockdep_assert_irqs_enabled(); > @@ -1669,12 +1667,49 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch) > local_irq_enable(); > } > > + /* > + * If we issued (asynchronous) INVLPGB flushes, wait for them here. > + * The cpumask above contains only CPUs that were running tasks > + * not using broadcast TLB flushing. > + */ > + if (cpu_feature_enabled(X86_FEATURE_INVLPGB) && batch->used_invlpgb) { > + tlbsync(); > + migrate_enable(); > + batch->used_invlpgb = false; > + } > + > cpumask_clear(&batch->cpumask); > > put_flush_tlb_info(); > put_cpu(); > } > > +void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch, > + struct mm_struct *mm, > + unsigned long uaddr) > +{ > + if (static_cpu_has(X86_FEATURE_INVLPGB) && mm_global_asid(mm)) { > + u16 asid = mm_global_asid(mm); > + /* > + * Queue up an asynchronous invalidation. The corresponding > + * TLBSYNC is done in arch_tlbbatch_flush(), and must be done > + * on the same CPU. > + */ > + if (!batch->used_invlpgb) { > + batch->used_invlpgb = true; > + migrate_disable(); > + } > + invlpgb_flush_user_nr_nosync(kern_pcid(asid), uaddr, 1, false); > + /* Do any CPUs supporting INVLPGB need PTI? */ > + if (static_cpu_has(X86_FEATURE_PTI)) > + invlpgb_flush_user_nr_nosync(user_pcid(asid), uaddr, 1, false); > + } else { > + inc_mm_tlb_gen(mm); > + cpumask_or(&batch->cpumask, &batch->cpumask, mm_cpumask(mm)); > + } > + mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL); > +} How does this work if the MM is currently transitioning to a global ASID? Should the "mm_global_asid(mm)" check maybe be replaced with something that checks if the MM has fully transitioned to a global ASID, so that we keep using the classic path if there might be holdout CPUs?