On 2/13/25 08:14, Rik van Riel wrote: > Use broadcast TLB invalidation, using the INVPLGB instruction, on AMD EPYC 3 > and newer CPUs. Could we please just zap the "on AMD EPYC 3 and newer CPUs" from all of these patches? It can be mentioned once in the cover letter or something, but it doesn't need to be repeated. > In order to not exhaust PCID space, and keep TLB flushes local for single > threaded processes, we only hand out broadcast ASIDs to processes active on > 4 or more CPUs. Please no "we's". Use imperative voice. This also needs some fleshing out. Perhaps: There is not enough room in the 12-bit ASID address space to hand out broadcast ASIDs to every process. Only hand out broadcast ASIDs to processes when they are observed to be simultaneously running on 4 or more CPUs. Most importantly, this ensures that single threaded processes continue to use the cheaper, legacy, local TLB invalidation instructions like INVLPG. > +#ifdef CONFIG_X86_BROADCAST_TLB_FLUSH > + u16 global_asid; > + bool asid_transition; > +#endif > + > } mm_context_t; Please give these at least a line or two comment explaining what they do. > #define INIT_MM_CONTEXT(mm) \ > diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h > index 795fdd53bd0a..d670699d32c2 100644 > --- a/arch/x86/include/asm/mmu_context.h > +++ b/arch/x86/include/asm/mmu_context.h > @@ -139,6 +139,8 @@ static inline void mm_reset_untag_mask(struct mm_struct *mm) > #define enter_lazy_tlb enter_lazy_tlb > extern void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk); > > +extern void destroy_context_free_global_asid(struct mm_struct *mm); > + > /* > * Init a new mm. Used on mm copies, like at fork() > * and on mm's that are brand-new, like at execve(). > @@ -161,6 +163,14 @@ static inline int init_new_context(struct task_struct *tsk, > mm->context.execute_only_pkey = -1; > } > #endif > + > +#ifdef CONFIG_X86_BROADCAST_TLB_FLUSH > + if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) { > + mm->context.global_asid = 0; > + mm->context.asid_transition = false; > + } > +#endif > + > mm_reset_untag_mask(mm); > init_new_context_ldt(mm); > return 0; > @@ -170,6 +180,10 @@ static inline int init_new_context(struct task_struct *tsk, > static inline void destroy_context(struct mm_struct *mm) > { > destroy_context_ldt(mm); > +#ifdef CONFIG_X86_BROADCAST_TLB_FLUSH > + if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) > + destroy_context_free_global_asid(mm); > +#endif > } > > extern void switch_mm(struct mm_struct *prev, struct mm_struct *next, > diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h > index bda7080dec83..3080cb8d21dc 100644 > --- a/arch/x86/include/asm/tlbflush.h > +++ b/arch/x86/include/asm/tlbflush.h > @@ -6,6 +6,7 @@ > #include <linux/mmu_notifier.h> > #include <linux/sched.h> > > +#include <asm/barrier.h> > #include <asm/processor.h> > #include <asm/cpufeature.h> > #include <asm/special_insns.h> > @@ -239,6 +240,78 @@ void flush_tlb_one_kernel(unsigned long addr); > void flush_tlb_multi(const struct cpumask *cpumask, > const struct flush_tlb_info *info); > > +#ifdef CONFIG_X86_BROADCAST_TLB_FLUSH > +static inline bool is_dyn_asid(u16 asid) > +{ > + if (!cpu_feature_enabled(X86_FEATURE_INVLPGB)) > + return true; > + > + return asid < TLB_NR_DYN_ASIDS; > +} If possible, could you avoid double-defining the helpers that will compile with and without CONFIG_X86_BROADCAST_TLB_FLUSH? Just put the #ifdef around the ones that *need* it. ... > +static inline bool in_asid_transition(struct mm_struct *mm) > +{ > + if (!cpu_feature_enabled(X86_FEATURE_INVLPGB)) > + return false; > + > + return mm && READ_ONCE(mm->context.asid_transition); > +} > + > +static inline u16 mm_global_asid(struct mm_struct *mm) > +{ > + u16 asid; > + > + if (!cpu_feature_enabled(X86_FEATURE_INVLPGB)) > + return 0; > + > + asid = smp_load_acquire(&mm->context.global_asid); > + > + /* mm->context.global_asid is either 0, or a global ASID */ > + VM_WARN_ON_ONCE(asid && is_dyn_asid(asid)); > + > + return asid; > +} Yay, some kind of custom lock. :) Could give us a little idea what the locking rules are here and why this neds the READ_ONCE() and smp_load_acquire()? ... > + /* > + * TLB consistency for global ASIDs is maintained with broadcast TLB > + * flushing. The TLB is never outdated, and does not need flushing. > + */ This is another case where I think using the word "broadcast" is not helping. Here's the problem: INVLPGB is a "INVLPG" that's broadcast. So the name INVLPGB is fine. INVLPGB is *a* way to broadcast INVLPG which is *a* kind of TLB invalidation. But, to me "broadcast TLB flushing" is a broad term. In arguably includes INVLPGB and normal IPI-based flushing. Just like the function naming in the earlier patch, I think we need a better term here. > + if (static_cpu_has(X86_FEATURE_INVLPGB)) { > + u16 global_asid = mm_global_asid(next); > + > + if (global_asid) { > + *new_asid = global_asid; > + *need_flush = false; > + return; > + } > + } > +#ifdef CONFIG_X86_BROADCAST_TLB_FLUSH How cleanly could we throw this hunk in a new file? I always dislike big #ifdefs like this. They seem like magnets for causing weird compile problems. > +/* > + * Logic for broadcast TLB invalidation. > + */ > +static DEFINE_RAW_SPINLOCK(global_asid_lock); The global lock definitely needs some discussion in the changelog. > +static u16 last_global_asid = MAX_ASID_AVAILABLE; > +static DECLARE_BITMAP(global_asid_used, MAX_ASID_AVAILABLE) = { 0 }; > +static DECLARE_BITMAP(global_asid_freed, MAX_ASID_AVAILABLE) = { 0 }; Isn't the initialization to all 0's superfluous for a global variable? > +static int global_asid_available = MAX_ASID_AVAILABLE - TLB_NR_DYN_ASIDS - 1; > + > +static void reset_global_asid_space(void) > +{ > + lockdep_assert_held(&global_asid_lock); > + > + /* > + * A global TLB flush guarantees that any stale entries from > + * previously freed global ASIDs get flushed from the TLB > + * everywhere, making these global ASIDs safe to reuse. > + */ > + invlpgb_flush_all_nonglobals(); Ugh, my suggestion to use the term "global ASID" really doesn't work here, does it? Also, isn't a invlpgb_flush_all_nonglobals() _relatively_ slow? It has to go out and talk over the fabric to every CPU, right? This is also holding a global lock. That's seems worrisome. > + /* > + * Clear all the previously freed global ASIDs from the > + * broadcast_asid_used bitmap, now that the global TLB flush > + * has made them actually available for re-use. > + */ > + bitmap_andnot(global_asid_used, global_asid_used, > + global_asid_freed, MAX_ASID_AVAILABLE); > + bitmap_clear(global_asid_freed, 0, MAX_ASID_AVAILABLE); > + > + /* > + * ASIDs 0-TLB_NR_DYN_ASIDS are used for CPU-local ASID > + * assignments, for tasks doing IPI based TLB shootdowns. > + * Restart the search from the start of the global ASID space. > + */ > + last_global_asid = TLB_NR_DYN_ASIDS; > +} > + > +static u16 get_global_asid(void) > +{ > + > + u16 asid; > + > + lockdep_assert_held(&global_asid_lock); > + > + /* The previous allocated ASID is at the top of the address space. */ > + if (last_global_asid >= MAX_ASID_AVAILABLE - 1) > + reset_global_asid_space(); > + > + asid = find_next_zero_bit(global_asid_used, MAX_ASID_AVAILABLE, last_global_asid); > + > + if (asid >= MAX_ASID_AVAILABLE) { > + /* This should never happen. */ > + VM_WARN_ONCE(1, "Unable to allocate global ASID despite %d available\n", > + global_asid_available); > + return 0; > + } > + > + /* Claim this global ASID. */ > + __set_bit(asid, global_asid_used); > + last_global_asid = asid; > + global_asid_available--; > + return asid; > +} > + > +/* > + * Returns true if the mm is transitioning from a CPU-local ASID to a global > + * (INVLPGB) ASID, or the other way around. > + */ > +static bool needs_global_asid_reload(struct mm_struct *next, u16 prev_asid) > +{ > + u16 global_asid = mm_global_asid(next); > + > + /* Process is transitioning to a global ASID */ > + if (global_asid && prev_asid != global_asid) > + return true; > + > + /* Transition from global->local ASID does not currently happen. */ > + if (!global_asid && is_global_asid(prev_asid)) > + return true; > + > + return false; > +} I'm going to throw in the towel at this point. This patch needs to get broken up. It's more at once than my poor little brain can handle. The _least_ it can do is introduce the stub functions and injection into existing code changes, first. Then, in a second patch, introduce the real implementation. I also suspect that a big chunk of the ASID allocator could be broken out and introduced separately. Another example is broadcast_tlb_flush(). To reduce complexity in _this_ patch, it could do something suboptimal like always do a invlpgb_flush_all_nonglobals() regardless of the kind of flush it gets. Then, in a later patch, you could optimize it.