On Mon, 2025-01-20 at 16:02 +0200, Nadav Amit wrote: > > > On 20/01/2025 4:40, Rik van Riel wrote: > > > > +static inline void broadcast_tlb_flush(struct flush_tlb_info > > *info) > > +{ > > + VM_WARN_ON_ONCE(1); > > Not sure why not the use VM_WARN_ONCE() instead with some more > informative message (anyhow, a string is allocated for it). > VM_WARN_ON_ONCE only has a condition, not a message. > > > > +static u16 get_global_asid(void) > > +{ > > + lockdep_assert_held(&global_asid_lock); > > + > > + do { > > + u16 start = last_global_asid; > > + u16 asid = find_next_zero_bit(global_asid_used, > > MAX_ASID_AVAILABLE, start); > > + > > + if (asid >= MAX_ASID_AVAILABLE) { > > + reset_global_asid_space(); > > + continue; > > + } > > I think that unless something is awfully wrong, you are supposed to > at > most call reset_global_asid_space() once. So if that's the case, why > not > do it this way? > > Instead, you can get rid of the loop and just do: > > asid = find_next_zero_bit(global_asid_used, > MAX_ASID_AVAILABLE, start); > > If you want, you can warn if asid >= MAX_ASID_AVAILABLE and have some > fallback. But the loop, is just confusing in my opinion for no > reason. I can get rid of the loop. You're right that the code can just call find_next_zero_bit after calling reset_global_asid_space. > > > + /* Slower check to make sure. */ > > + for_each_cpu(cpu, mm_cpumask(mm)) { > > + /* Skip the CPUs that aren't really running this > > process. */ > > + if (per_cpu(cpu_tlbstate.loaded_mm, cpu) != mm) > > + continue; > > Then perhaps at least add a comment next to loaded_mm, that it's not > private per-se, but rarely accessed by other cores? > I don't see any comment in struct tlb_state that suggests it was ever private to begin with. Which comment are you referring to that should be edited? > > > > + > > + /* > > + * The transition from IPI TLB flushing, with a dynamic > > ASID, > > + * and broadcast TLB flushing, using a global ASID, uses > > memory > > + * ordering for synchronization. > > + * > > + * While the process has threads still using a dynamic > > ASID, > > + * TLB invalidation IPIs continue to get sent. > > + * > > + * This code sets asid_transition first, before assigning > > the > > + * global ASID. > > + * > > + * The TLB flush code will only verify the ASID transition > > + * after it has seen the new global ASID for the process. > > + */ > > + WRITE_ONCE(mm->context.asid_transition, true); > > + WRITE_ONCE(mm->context.global_asid, get_global_asid()); > > I know it is likely correct in practice (due to TSO memory model), > but > it is not clear, at least for me, how those write order affects the > rest > of the code. I managed to figure out how it relates to the reads in > flush_tlb_mm_range() and native_flush_tlb_multi(), but I wouldn't say > it > is trivial and doesn't worth a comment (or smp_wmb/smp_rmb). > What kind of wording should we add here to make it easier to understand? "The TLB invalidation code reads these variables in the opposite order in which they are written" ? > > + /* > > + * If at least one CPU is not using the global > > ASID yet, > > + * send a TLB flush IPI. The IPI should cause > > stragglers > > + * to transition soon. > > + * > > + * This can race with the CPU switching to another > > task; > > + * that results in a (harmless) extra IPI. > > + */ > > + if (READ_ONCE(per_cpu(cpu_tlbstate.loaded_mm_asid, > > cpu)) != bc_asid) { > > + flush_tlb_multi(mm_cpumask(info->mm), > > info); > > + return; > > I am trying to figure out why we return here. The transition might > not > be over? Why is it "soon"? Wouldn't flush_tlb_func() reload it > unconditionally? The transition _should_ be over, but what if another CPU got an NMI while in the middle of switch_mm_irqs_off, and set its own bit in the mm_cpumask after we send this IPI? On the other hand, if it sets its mm_cpumask bit after this point, it will also load the mm->context.global_asid after this point, and should definitely get the new ASID. I think we are probably fine to set asid_transition to false here, but I've had to tweak this code so much over the past months that I don't feel super confident any more :) > > > + /* > > + * TLB flushes with INVLPGB are kicked off asynchronously. > > + * The inc_mm_tlb_gen() guarantees page table updates are > > done > > + * before these TLB flushes happen. > > + */ > > + if (info->end == TLB_FLUSH_ALL) { > > + invlpgb_flush_single_pcid_nosync(kern_pcid(asid)); > > + /* Do any CPUs supporting INVLPGB need PTI? */ > > + if (static_cpu_has(X86_FEATURE_PTI)) > > + invlpgb_flush_single_pcid_nosync(user_pcid > > (asid)); > > + } else for (; addr < info->end; addr += nr << info- > > >stride_shift) { > > I guess I was wrong, and do-while was cleaner here. > > And I guess this is now a bug, if info->stride_shift > PMD_SHIFT... > We set maxnr to 1 for larger stride shifts at the top of the function: /* Flushing multiple pages at once is not supported with 1GB pages. */ if (info->stride_shift > PMD_SHIFT) maxnr = 1; > [ I guess the cleanest way was to change get_flush_tlb_info to mask > the > low bits of start and end based on ((1ull << stride_shift) - 1). But > whatever... ] I'll change it back :) I'm just happy this code is getting lots of attention, and we're improving it with time. > > @@ -573,6 +874,23 @@ void switch_mm_irqs_off(struct mm_struct > > *unused, struct mm_struct *next, > > !cpumask_test_cpu(cpu, > > mm_cpumask(next)))) > > cpumask_set_cpu(cpu, mm_cpumask(next)); > > > > + /* > > + * Check if the current mm is transitioning to a > > new ASID. > > + */ > > + if (needs_global_asid_reload(next, prev_asid)) { > > + next_tlb_gen = atomic64_read(&next- > > >context.tlb_gen); > > + > > + choose_new_asid(next, next_tlb_gen, > > &new_asid, &need_flush); > > + goto reload_tlb; > > Not a fan of the goto's when they are not really needed, and I don't > think it is really needed here. Especially that the name of the tag > "reload_tlb" does not really convey that the page-tables are reloaded > at > that point. In this particular case, the CPU continues running with the same page tables, but with a different PCID. > > > + } > > + > > + /* > > + * Broadcast TLB invalidation keeps this PCID up > > to date > > + * all the time. > > + */ > > + if (is_global_asid(prev_asid)) > > + return; > > Hard for me to convince myself When a process uses a global ASID, we always send out TLB invalidations using INVLPGB. The global ASID should always be up to date. > > > @@ -769,6 +1092,16 @@ static void flush_tlb_func(void *info) > > if (unlikely(loaded_mm == &init_mm)) > > return; > > > > + /* Reload the ASID if transitioning into or out of a > > global ASID */ > > + if (needs_global_asid_reload(loaded_mm, loaded_mm_asid)) { > > + switch_mm_irqs_off(NULL, loaded_mm, NULL); > > I understand you want to reuse that logic, but it doesn't seem > reasonable to me. It both doesn't convey what you want to do, and can > lead to undesired operations - cpu_tlbstate_update_lam() for > instance. > Probably the impact on performance is minor, but it is an opening for > future mistakes. My worry with having a separate code path here is that the separate code path could bit rot, and we could introduce bugs that way. I would rather have a tiny performance impact in what is a rare code path, than a rare (and hard to track down) memory corruption due to bit rot. > > > + loaded_mm_asid = > > this_cpu_read(cpu_tlbstate.loaded_mm_asid); > > + } > > + > > + /* Broadcast ASIDs are always kept up to date with > > INVLPGB. */ > > + if (is_global_asid(loaded_mm_asid)) > > + return; > > The comment does not clarify to me, and I don't manage to clearly > explain to myself, why it is guaranteed that all the IPI TLB flushes, > which were potentially issued before the transition, are not needed. > IPI TLB flushes that were issued before the transition went to the CPUs when they were using dynamic ASIDs (numbers 1-5). Reloading the TLB with a different PCID, even pointed at the same page tables, means that the TLB should load the translations fresh from the page tables, and not re-use any that it had previously loaded under a different PCID. -- All Rights Reversed.