On Mon, 2024-12-30 at 21:24 +0200, Nadav Amit wrote: > > > --- a/arch/x86/include/asm/tlbflush.h > > +++ b/arch/x86/include/asm/tlbflush.h > > @@ -65,6 +65,23 @@ static inline void cr4_clear_bits(unsigned long > > mask) > > */ > > #define TLB_NR_DYN_ASIDS 6 > > > > +#ifdef CONFIG_CPU_SUP_AMD > > +#define is_dyn_asid(asid) (asid) < TLB_NR_DYN_ASIDS > > > > I don’t see a reason why those should be #define instead of inline > functions. > Arguably, those are better due to type-checking, etc. For instance > is_dyn_asid() > is missing brackets to be safe. > I've turned these all into inline functions for the next version. > > > > @@ -225,6 +227,18 @@ static void choose_new_asid(struct mm_struct > > *next, u64 next_tlb_gen, > > return; > > } > > > > + /* > > + * TLB consistency for this ASID is maintained with > > INVLPGB; > > + * TLB flushes happen even while the process isn't > > running. > > + */ > > +#ifdef CONFIG_CPU_SUP_AMD > I’m pretty sure IS_ENABLED() can be used here. > > > + if (static_cpu_has(X86_FEATURE_INVLPGB) && > > mm_broadcast_asid(next)) { > > + *new_asid = mm_broadcast_asid(next); > > Isn’t there a risk of a race changing broadcast_asid between the two > reads? > > Maybe use READ_ONCE() also since the value is modified > asynchronously? > In the current code, the broadcast_asid only ever changes from zero to a non-zero value, which continues to be used for the lifetime of the process. It will not change from one non-zero to another non-zero value. I have cleaned up the code a bit by having just one single read, though. > > > > @@ -251,6 +265,245 @@ static void choose_new_asid(struct mm_struct > > *next, u64 next_tlb_gen, > > *need_flush = true; > > } > > > > +#ifdef CONFIG_CPU_SUP_AMD > > +/* > > + * Logic for AMD INVLPGB support. > > + */ > > +static DEFINE_RAW_SPINLOCK(broadcast_asid_lock); > > +static u16 last_broadcast_asid = TLB_NR_DYN_ASIDS; > > +static DECLARE_BITMAP(broadcast_asid_used, MAX_ASID_AVAILABLE) = { > > 0 }; > > +static LIST_HEAD(broadcast_asid_list); > > +static int broadcast_asid_available = MAX_ASID_AVAILABLE - > > TLB_NR_DYN_ASIDS - 1; > > Presumably some of these data structures are shared, and some are > accessed > frequently together. Wouldn’t it make more sense to put them inside a > struct(s) > and make it cacheline aligned? > Maybe? I'm not sure any of these are particularly frequently used, at least not compared to things like TLB invalidations. > > > > + /* Try claiming this broadcast ASID. */ > > + if (!test_and_set_bit(asid, broadcast_asid_used)) > > { > > IIUC, broadcast_asid_used is always protected with > broadcast_asid_lock. > So why test_and_set_bit ? Thanks for spotting that one. I cleaned that up for the next version. > > > +void destroy_context_free_broadcast_asid(struct mm_struct *mm) > > +{ > > + if (!mm->context.broadcast_asid) > > mm_broadcast_asid()? I've intentionally kept this one in the same "shape" as the assignment a few lines lower. That might be cleaner than reading it one way, and then writing it another way. > > > +static void use_broadcast_asid(struct mm_struct *mm) > > +{ > > + guard(raw_spinlock_irqsave)(&broadcast_asid_lock); > > + > > + /* This process is already using broadcast TLB > > invalidation. */ > > + if (mm->context.broadcast_asid) > > + return; > > + > > + mm->context.broadcast_asid = get_broadcast_asid(); > > This is read without the lock, so do you want WRITE_ONCE() here? > > > + mm->context.asid_transition = true; > > And what about asid_transition? Presumably also need WRITE_ONCE(). > But more > importantly than this theoretical compiler optimization, is there > some assumed > ordering with setting broadcast_asid? I changed both to WRITE_ONCE. Thinking about ordering, maybe we need to set asid_transition before setting broadcast_asid? That way when we see a non-zero broadcast ASID, we are guaranteed to try finish_asid_transition() from broadcast_tlb_flush(). Fixed for the next version. > > > + if (meets_broadcast_asid_threshold(mm)) > > + use_broadcast_asid(mm); > > +} > > I don’t think count_tlb_flush() is a name that reflects what this > function > does. > Agreed. I've renamed it to the equally lame consider_broadcast_asid :) > > + > > +static void finish_asid_transition(struct flush_tlb_info *info) > > +{ > > + struct mm_struct *mm = info->mm; > > + int bc_asid = mm_broadcast_asid(mm); > > + int cpu; > > + > > + if (!mm->context.asid_transition) > > is_asid_transition()? I'm not convinced that for a thing we access in so few places, having a different "shape" for the read than we have for the write would make the code easier to follow. I'm open to arguments, though ;) > > > + if (info->end == TLB_FLUSH_ALL) { > > + invlpgb_flush_single_pcid(kern_pcid(asid)); > > + /* Do any CPUs supporting INVLPGB need PTI? */ > > + if (static_cpu_has(X86_FEATURE_PTI)) > > + invlpgb_flush_single_pcid(user_pcid(asid)) > > ; > > + } else do { > > I couldn’t find any use of “else do” in the kernel. Might it be > confusing? I could replace it with a goto finish_asid_transition. That's what I had there before. I'm not sure it was better, though :/ > > > + /* > > + * Calculate how many pages can be flushed at > > once; if the > > + * remainder of the range is less than one page, > > flush one. > > + */ > > + nr = min(maxnr, (info->end - addr) >> info- > > >stride_shift); > > + nr = max(nr, 1); > > + > > + invlpgb_flush_user_nr(kern_pcid(asid), addr, nr, > > pmd); > > + /* Do any CPUs supporting INVLPGB need PTI? */ > > + if (static_cpu_has(X86_FEATURE_PTI)) > > + invlpgb_flush_user_nr(user_pcid(asid), > > addr, nr, pmd); > > + addr += nr << info->stride_shift; > > + } while (addr < info->end); > > + > > + finish_asid_transition(info); > > + > > + /* Wait for the INVLPGBs kicked off above to finish. */ > > + tlbsync(); > > +} > > > > @@ -556,8 +809,9 @@ void switch_mm_irqs_off(struct mm_struct > > *unused, struct mm_struct *next, > > */ > > if (prev == next) { > > /* Not actually switching mm's */ > > - > > VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) != > > - next->context.ctx_id); > > + if (is_dyn_asid(prev_asid)) > > + VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs > > [prev_asid].ctx_id) != > > + next->context.ctx_id); > > Why not to add the condition into the VM_WARN_ON and avoid the > nesting? Done. Thank you. > > > @@ -1026,14 +1311,18 @@ void flush_tlb_mm_range(struct mm_struct > > *mm, unsigned long start, > > bool freed_tables) > > { > > struct flush_tlb_info *info; > > + unsigned long threshold = tlb_single_page_flush_ceiling; > > u64 new_tlb_gen; > > int cpu; > > > > + if (static_cpu_has(X86_FEATURE_INVLPGB)) > > + threshold *= invlpgb_count_max; > > I know it’s not really impacting performance, but it is hard for me > to see > such calculations happening unnecessarily every time... > Thanks for pointing out this code. We should only do the multiplication if this mm is using broadcast TLB invalidation. Fixed for the next version. I left the multiplication for now, since that is smaller than adding code to the debugfs tlb_single_page_flush_ceiling handler to store a multiplied value in another variable. -- All Rights Reversed.