On Mon, 2025-01-13 at 15:09 +0200, Nadav Amit wrote: > > > On 12 Jan 2025, at 17:53, Rik van Riel <riel@xxxxxxxxxxx> wrote: > > > > +#ifdef CONFIG_X86_BROADCAST_TLB_FLUSH > > + u16 global_asid; > > + bool asid_transition; > > As I later note, there are various ordering issues between the two. > Would it be > just easier to combine them into one field? I know everybody hates > bitfields so > I don’t suggest it, but there are other ways... It's certainly an option, but I think we figured out the ordering issues, so at this point documentation and readability of the code might be more important for future proofing? > > > @@ -170,6 +180,10 @@ static inline int init_new_context(struct > > task_struct *tsk, > > static inline void destroy_context(struct mm_struct *mm) > > { > > destroy_context_ldt(mm); > > +#ifdef CONFIG_X86_BROADCAST_TLB_FLUSH > > I’d prefer to use IS_ENABLED() and to have a stub for > destroy_context_free_global_asid(). > > > + if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) > > + destroy_context_free_global_asid(mm); > > +#endif > > } I'll think about how to do this cleaner. I would like to keep the cpu feature test in the inline function, so we don't do an unnecessary function call on systems without INVLPGB. > > > > +static inline bool in_asid_transition(const struct flush_tlb_info > > *info) > > +{ > > + if (!cpu_feature_enabled(X86_FEATURE_INVLPGB)) > > + return false; > > + > > + return info->mm && info->mm->context.asid_transition; > > READ_ONCE(context.asid_transition) ? Changed for the next version. > > > +static inline void broadcast_tlb_flush(struct flush_tlb_info > > *info) > > +{ > > Having a VM_WARN_ON() here might be nice. Added. Thank you. > > > +static u16 get_global_asid(void) > > +{ > > + lockdep_assert_held(&global_asid_lock); > > + > > + do { > > + u16 start = last_global_asid; > > + u16 asid = find_next_zero_bit(global_asid_used, > > MAX_ASID_AVAILABLE, start); > > + > > + if (asid >= MAX_ASID_AVAILABLE) { > > + reset_global_asid_space(); > > + continue; > > + } > > + > > + /* Claim this global ASID. */ > > + __set_bit(asid, global_asid_used); > > + last_global_asid = asid; > > + return asid; > > + } while (1); > > This does not make me feel easy at all. I do not understand > why it might happen. The caller should’ve already checked the global > ASID > is available under the lock. If it is not obvious from the code, > perhaps > refactoring is needed. > The caller checks whether we have a global ASID available, anywhere in the namespace. This code will claim a specific one. I guess the global_asid_available-- line could be moved into get_global_asid() to improve readability? > > +static bool mm_active_cpus_exceeds(struct mm_struct *mm, int > > threshold) > > +{ > > + int count = 0; > > + int cpu; > > + > > + /* This quick check should eliminate most single threaded > > programs. */ > > + if (cpumask_weight(mm_cpumask(mm)) <= threshold) > > + return false; > > + > > + /* Slower check to make sure. */ > > + for_each_cpu(cpu, mm_cpumask(mm)) { > > + /* Skip the CPUs that aren't really running this > > process. */ > > + if (per_cpu(cpu_tlbstate.loaded_mm, cpu) != mm) > > + continue; > > Do you really want to make loaded_mm accessed from other cores? Does > this > really provide worthy benefit? > > Why not just use cpumask_weight() and be done with it? Anyhow it’s a > heuristic. We recently added some code to make mm_cpumask maintenance a lot lazier, which can result in more CPUs remaining in the bitmap while not running the mm. As for accessing loaded_mm from other CPUs, we already have to do that in finish_asid_transition. I don't see any good way around that, but I'm open to suggestions :) > > > + /* > > + * The transition from IPI TLB flushing, with a dynamic > > ASID, > > + * and broadcast TLB flushing, using a global ASID, uses > > memory > > + * ordering for synchronization. > > + * > > + * While the process has threads still using a dynamic > > ASID, > > + * TLB invalidation IPIs continue to get sent. > > + * > > + * This code sets asid_transition first, before assigning > > the > > + * global ASID. > > + * > > + * The TLB flush code will only verify the ASID transition > > + * after it has seen the new global ASID for the process. > > + */ > > + WRITE_ONCE(mm->context.asid_transition, true); > > I would prefer smp_wmb() and document where the matching smp_rmb() > (or smp_mb) is. > > > + WRITE_ONCE(mm->context.global_asid, get_global_asid()); > > + > > + global_asid_available--; > > +} > > I would be happy with either style of ordering. It's all a bit of a no-op anyway, because x86 does not do stores out of order. > > +static void finish_asid_transition(struct flush_tlb_info *info) > > +{ > > + struct mm_struct *mm = info->mm; > > + int bc_asid = mm_global_asid(mm); > > + int cpu; > > + > > + if (!READ_ONCE(mm->context.asid_transition)) > > + return; > > + > > + for_each_cpu(cpu, mm_cpumask(mm)) { > > + /* > > + * The remote CPU is context switching. Wait for > > that to > > + * finish, to catch the unlikely case of it > > switching to > > + * the target mm with an out of date ASID. > > + */ > > + while (READ_ONCE(per_cpu(cpu_tlbstate.loaded_mm, > > cpu)) == LOADED_MM_SWITCHING) > > + cpu_relax(); > > Although this code should rarely run, it seems bad for a couple of > reasons: > > 1. It is a new busy-wait in a very delicate place. Lockdep is blind > to this > change. This is true. However, if a CPU gets stuck in the middle of switch_mm_irqs_off, won't we have a bigger problem? > > 2. cpu_tlbstate is supposed to be private for each core - that’s why > there > is cpu_tlbstate_shared. But I really think loaded_mm should be > kept > private. > > Can't we just do one TLB shootdown if > cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids That conflicts with a future optimization of simply not maintaining the mm_cpumask at all for processes that use broadcast TLB invalidation. After all, once we no longer use the mm_cpumask for anything any more, why incur the overhead of maintaining it? I would like to understand more about the harm of reading loaded_mm. How is reading loaded_mm worse than reading other per-CPU variables that are written from the same code paths? > > > + /* All the CPUs running this process are using the global > > ASID. */ > > I guess it’s ordered with the flushes (the flushes must complete > first). > If there are any flushes. If all the CPUs we scanned are already using the global ASID, we do not need any additional ordering in here, since any CPUs switching to this mm afterward will be seeing the new global ASID. > > + WRITE_ONCE(mm->context.asid_transition, false); > > +} > > > > + } else do { > > + /* > > + * Calculate how many pages can be flushed at > > once; if the > > + * remainder of the range is less than one page, > > flush one. > > + */ > > + nr = min(maxnr, (info->end - addr) >> info- > > >stride_shift); > > + nr = max(nr, 1); > > + > > + invlpgb_flush_user_nr_nosync(kern_pcid(asid), > > addr, nr, pmd); > > + /* Do any CPUs supporting INVLPGB need PTI? */ > > + if (static_cpu_has(X86_FEATURE_PTI)) > > + invlpgb_flush_user_nr_nosync(user_pcid(asi > > d), addr, nr, pmd); > > + addr += nr << info->stride_shift; > > + } while (addr < info->end); > > I would have preferred for instead of while... Changed that for the next version. Thank you. > > @@ -1049,9 +1385,12 @@ void flush_tlb_mm_range(struct mm_struct > > *mm, unsigned long start, > > * a local TLB flush is needed. Optimize this use-case by > > calling > > * flush_tlb_func_local() directly in this case. > > */ > > - if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) { > > I think an smp_rmb() here would communicate the fact > in_asid_transition() and > mm_global_asid() must be ordered. > > > + if (mm_global_asid(mm)) { > > + broadcast_tlb_flush(info); We have the barrier already, in the form of inc_mm_tlb_gen a few lines up. Or are you talking about a barrier (or READ_ONCE?) inside of mm_global_asid() to make sure the compiler cannot reorder things around mm_global_asid()? > > + } else if (cpumask_any_but(mm_cpumask(mm), cpu) < > > nr_cpu_ids) { > > info->trim_cpumask = should_trim_cpumask(mm); > > flush_tlb_multi(mm_cpumask(mm), info); > > + consider_global_asid(mm); > > } else if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) { > > lockdep_assert_irqs_enabled(); > > local_irq_disable(); > > -- > > 2.47.1 > > > -- All Rights Reversed.