> On 20 Jan 2025, at 18:09, Rik van Riel <riel@xxxxxxxxxxx> wrote: > > On Mon, 2025-01-20 at 16:02 +0200, Nadav Amit wrote: >> >> >> On 20/01/2025 4:40, Rik van Riel wrote: >>> >>> +static inline void broadcast_tlb_flush(struct flush_tlb_info >>> *info) >>> +{ >>> + VM_WARN_ON_ONCE(1); >> >> Not sure why not the use VM_WARN_ONCE() instead with some more >> informative message (anyhow, a string is allocated for it). >> > VM_WARN_ON_ONCE only has a condition, not a message. Right, my bad. > >>> + /* Slower check to make sure. */ >>> + for_each_cpu(cpu, mm_cpumask(mm)) { >>> + /* Skip the CPUs that aren't really running this >>> process. */ >>> + if (per_cpu(cpu_tlbstate.loaded_mm, cpu) != mm) >>> + continue; >> >> Then perhaps at least add a comment next to loaded_mm, that it's not >> private per-se, but rarely accessed by other cores? >> > I don't see any comment in struct tlb_state that > suggests it was ever private to begin with. > > Which comment are you referring to that should > be edited? You can see there is a tlb_state_shared, so one assumes tlb_state is private... (at least that was my intention separating them). > >>> >>> + WRITE_ONCE(mm->context.asid_transition, true); >>> + WRITE_ONCE(mm->context.global_asid, get_global_asid()); >> >> I know it is likely correct in practice (due to TSO memory model), >> but >> it is not clear, at least for me, how those write order affects the >> rest >> of the code. I managed to figure out how it relates to the reads in >> flush_tlb_mm_range() and native_flush_tlb_multi(), but I wouldn't say >> it >> is trivial and doesn't worth a comment (or smp_wmb/smp_rmb). >> > > What kind of wording should we add here to make it > easier to understand? > > "The TLB invalidation code reads these variables in > the opposite order in which they are written" ? Usually in such cases, you make a reference to wherever there are readers that rely on the ordering. This is how documenting smp_wmb()/smp_rmb() ordering is usually done. > > >>> >>> + /* >>> + * If at least one CPU is not using the global ASID yet, >>> + * send a TLB flush IPI. The IPI should cause stragglers >>> + * to transition soon. >>> + * >>> + * This can race with the CPU switching to another task; >>> + * that results in a (harmless) extra IPI. >>> + */ >>> + if (READ_ONCE(per_cpu(cpu_tlbstate.loaded_mm_asid, cpu)) != bc_asid) { >>> + flush_tlb_multi(mm_cpumask(info->mm), info); >>> + return; >>> + } >> >> I am trying to figure out why we return here. The transition might >> not >> be over? Why is it "soon"? Wouldn't flush_tlb_func() reload it >> unconditionally? > > The transition _should_ be over, but what if another > CPU got an NMI while in the middle of switch_mm_irqs_off, > and set its own bit in the mm_cpumask after we send this > IPI? > > On the other hand, if it sets its mm_cpumask bit after > this point, it will also load the mm->context.global_asid > after this point, and should definitely get the new ASID. > > I think we are probably fine to set asid_transition to > false here, but I've had to tweak this code so much over > the past months that I don't feel super confident any more :) I fully relate, but I am not sure it is that great. The problem is that nobody would have the guts to change that code later... >> >> And I guess this is now a bug, if info->stride_shift > PMD_SHIFT... >> > We set maxnr to 1 for larger stride shifts at the top of the function: > You’re right, all safe. > >>> + goto reload_tlb; >> >> Not a fan of the goto's when they are not really needed, and I don't >> think it is really needed here. Especially that the name of the tag >> "reload_tlb" does not really convey that the page-tables are reloaded >> at >> that point. > > In this particular case, the CPU continues running with > the same page tables, but with a different PCID. I understand it is “reload_tlb” from your point of view, or from the point of view of the code that does the “goto”, but if I showed you the code that follows the “reload_tlb”, I’m not sure you’d know it is so. [ snip, taking your valid points ] >> >>> + loaded_mm_asid = >>> this_cpu_read(cpu_tlbstate.loaded_mm_asid); >>> + } >>> + >>> + /* Broadcast ASIDs are always kept up to date with >>> INVLPGB. */ >>> + if (is_global_asid(loaded_mm_asid)) >>> + return; >> >> The comment does not clarify to me, and I don't manage to clearly >> explain to myself, why it is guaranteed that all the IPI TLB flushes, >> which were potentially issued before the transition, are not needed. >> > IPI TLB flushes that were issued before the transition went > to the CPUs when they were using dynamic ASIDs (numbers 1-5). > > Reloading the TLB with a different PCID, even pointed at the > same page tables, means that the TLB should load the > translations fresh from the page tables, and not re-use any > that it had previously loaded under a different PCID. > What about this scenario for instance? CPU0 CPU1 CPU2 ---- ---- ---- (1) use_global_asid(mm): mm->context.asid_trans = T; mm->context.global_asid = G; (2) switch_mm(..., next=mm): *Observes global_asid = G => loads CR3 with PCID=G => fills TLB under G. TLB caches PTE[G, V] = P (for some reason) (3) flush_tlb_mm_range(mm): *Sees global_asid == 0 (stale/old value) => flush_tlb_multi() => IPI flush for dyn. (4) IPI arrives on CPU1: flush_tlb_func(...): is_global_asid(G)? yes, skip invalidate; broadcast flush assumed to cover it. (5) IPI completes on CPU2: Dyn. ASIDs are flushed, but CPU1’s global ASID was never invalidated! (6) CPU1 uses stale TLB entries under ASID G. TLB continues to use PTE[G, V] = P, as it was not invalidated.