On Fri, 2025-01-03 at 18:36 +0100, Jann Horn wrote: > > > +++ b/arch/x86/include/asm/mmu.h > > @@ -48,6 +48,12 @@ typedef struct { > > unsigned long flags; > > #endif > > > > +#ifdef CONFIG_CPU_SUP_AMD > > + struct list_head broadcast_asid_list; > > + u16 broadcast_asid; > > + bool asid_transition; > > Please add a comment on the semantics of the "asid_transition" field > here after addressing the comments below. Will do. > > > +#endif > > + > > #ifdef CONFIG_ADDRESS_MASKING > > /* Active LAM mode: X86_CR3_LAM_U48 or X86_CR3_LAM_U57 or > > 0 (disabled) */ > > unsigned long lam_cr3_mask; > [...] > > +#ifdef CONFIG_CPU_SUP_AMD > > +/* > > + * Logic for AMD INVLPGB support. > > + */ > > +static DEFINE_RAW_SPINLOCK(broadcast_asid_lock); > > +static u16 last_broadcast_asid = TLB_NR_DYN_ASIDS; > > I wonder if this should be set to MAX_ASID_AVAILABLE or such to > ensure > that we do a flush before we start using the broadcast ASID space the > first time... Or is there something else that already guarantees that > all ASIDs of the TLB are flushed during kernel boot? That is a good idea. I don't know if the TLBs always get flushed on every kexec, for example, and having the wraparound code exercised early on every boot will be good as a self test for future proofing the code. I'll do that in the next version. > > > +static u16 get_broadcast_asid(void) > > +{ > > + lockdep_assert_held(&broadcast_asid_lock); > > + > > + do { > > + u16 start = last_broadcast_asid; > > + u16 asid = find_next_zero_bit(broadcast_asid_used, > > MAX_ASID_AVAILABLE, start); > > + > > + if (asid >= MAX_ASID_AVAILABLE) { > > + reset_broadcast_asid_space(); > > + continue; > > Can this loop endlessly without making forward progress if we have a > few thousand processes on the system that are multi-threaded (or used > to be multi-threaded) and race the wrong way? > meets_broadcast_asid_threshold() checks if we have free IDs > remaining, > but that check happens before broadcast_asid_lock is held, so we > could > theoretically race such that no free IDs are available, right? You are right, I need to duplicate that check under the spinlock! I'll get that done in the next version. > > > + mm->context.broadcast_asid = get_broadcast_asid(); > > + mm->context.asid_transition = true; > > This looks buggy to me: Nadav found the same issue. I've fixed it locally already for the next version. > Maybe change how mm->context.asid_transition works such that it is > immediately set on mm creation and cleared when the transition is > done, so that you don't have to touch it here? > If we want to document the ordering, won't it be better to keep both assignments close to each other (with WRITE_ONCE), so the code stays easier to understand for future maintenance? > Also, please use at least WRITE_ONCE() for writes here, and add > comments documenting ordering requirements. I'll add a comment. > > > +static void finish_asid_transition(struct flush_tlb_info *info) > > +{ > > + struct mm_struct *mm = info->mm; > > + int bc_asid = mm_broadcast_asid(mm); > > + int cpu; > > + > > + if (!mm->context.asid_transition) > > AFAIU this can be accessed concurrently - please use at least > READ_ONCE(). (I think in the current version of the patch, this needs > to be ordered against the preceding mm_broadcast_asid() read, but > that's implicit on x86, so I guess writing a barrier here would be > superfluous.) I'll add a READ_ONCE here. Good point. > > > + return; > > + > > + for_each_cpu(cpu, mm_cpumask(mm)) { > > + if (READ_ONCE(per_cpu(cpu_tlbstate.loaded_mm, cpu)) > > != mm) > > + continue; > > switch_mm_irqs_off() picks an ASID and writes CR3 before writing > loaded_mm: > "/* Make sure we write CR3 before loaded_mm. */" > > Can we race with a concurrent switch_mm_irqs_off() on the other CPU > such that the other CPU has already switched CR3 to our MM using the > old ASID, but has not yet written loaded_mm, such that we skip it > here? And then we'll think we finished the ASID transition, and the > next time we do a flush, we'll wrongly omit the flush for that other > CPU even though it's still using the old ASID? That is a very good question. I suppose we need to check against LOADED_MM_SWITCHING too, and possibly wait to see what mm shows up on that CPU before proceeding? Maybe as simple as this? for_each_cpu(cpu, mm_cpumask(mm)) { while (READ_ONCE(per_cpu(cpu_tlbstate.loaded_mm, cpu) == LOADED_MM_SWITCHING) cpu_relax(); if (READ_ONCE(per_cpu(cpu_tlbstate.loaded_mm, cpu)) != mm) continue; /* * If at least one CPU is not using the broadcast ASID yet, * send a TLB flush IPI. The IPI should cause stragglers * to transition soon. */ if (per_cpu(cpu_tlbstate.loaded_mm_asid, cpu) != bc_asid) { flush_tlb_multi(mm_cpumask(info->mm), info); return; } } Then the only change needed to switch_mm_irqs_off would be to move the LOADED_MM_SWITCHING line to before choose_new_asid, to fully close the window. Am I overlooking anything here? > > > + > > + /* > > + * If at least one CPU is not using the broadcast > > ASID yet, > > + * send a TLB flush IPI. The IPI should cause > > stragglers > > + * to transition soon. > > + */ > > + if (per_cpu(cpu_tlbstate.loaded_mm_asid, cpu) != > > bc_asid) { > > READ_ONCE()? Also, I think this needs a comment explaining that this > can race with concurrent MM switches such that we wrongly think that > there's a straggler (because we're not reading the loaded_mm and the > loaded_mm_asid as one atomic combination). I'll add the READ_ONCE. Will the race still exist if we wait on LOADED_MM_SWITCHING as proposed above? > > > + flush_tlb_multi(mm_cpumask(info->mm), > > info); > > + return; > > + } > > + } > > + > > + /* All the CPUs running this process are using the > > broadcast ASID. */ > > + mm->context.asid_transition = 0; > > WRITE_ONCE()? > Also: This is a bool, please use "false". Will do. > > > +} > > + > > +static void broadcast_tlb_flush(struct flush_tlb_info *info) > > +{ > > + bool pmd = info->stride_shift == PMD_SHIFT; > > + unsigned long maxnr = invlpgb_count_max; > > + unsigned long asid = info->mm->context.broadcast_asid; > > + unsigned long addr = info->start; > > + unsigned long nr; > > + > > + /* Flushing multiple pages at once is not supported with > > 1GB pages. */ > > + if (info->stride_shift > PMD_SHIFT) > > + maxnr = 1; > > + > > + if (info->end == TLB_FLUSH_ALL) { > > + invlpgb_flush_single_pcid(kern_pcid(asid)); > > What orders this flush with the preceding page table update? Does the > instruction implicitly get ordered after preceding memory writes, or > do we get that ordering from inc_mm_tlb_gen() or something like that? I believe inc_mm_tlb_gen() should provide the ordering. You are right that it should be documented. > > > + /* Broadcast ASIDs are always kept up to date with INVLPGB. > > */ > > + if (is_broadcast_asid(loaded_mm_asid)) > > + return; > > This relies on the mm_broadcast_asid() read in flush_tlb_mm_range() > being ordered after the page table update, correct? And we get that > required ordering from the inc_mm_tlb_gen(), which implies a full > barrier? It might be nice if there were some more comments on this. I will add some comments, and I hope you can review those in the next series, because I'll no doubt forget to explain something important. Thank you for the thorough review! -- All Rights Reversed.