Re: [PATCH v6 09/12] x86/mm: enable broadcast TLB invalidation for multi-threaded processes

Rik van Riel <riel@xxxxxxxxxxx> · Mon, 20 Jan 2025 11:09:19 -0500

On Mon, 2025-01-20 at 16:02 +0200, Nadav Amit wrote:
> 
> 
> On 20/01/2025 4:40, Rik van Riel wrote:
> > 
> > +static inline void broadcast_tlb_flush(struct flush_tlb_info
> > *info)
> > +{
> > +	VM_WARN_ON_ONCE(1);
> 
> Not sure why not the use VM_WARN_ONCE() instead with some more 
> informative message (anyhow, a string is allocated for it).
> 
VM_WARN_ON_ONCE only has a condition, not a message.

> > 
> > +static u16 get_global_asid(void)
> > +{
> > +	lockdep_assert_held(&global_asid_lock);
> > +
> > +	do {
> > +		u16 start = last_global_asid;
> > +		u16 asid = find_next_zero_bit(global_asid_used,
> > MAX_ASID_AVAILABLE, start);
> > +
> > +		if (asid >= MAX_ASID_AVAILABLE) {
> > +			reset_global_asid_space();
> > +			continue;
> > +		}
> 
> I think that unless something is awfully wrong, you are supposed to
> at 
> most call reset_global_asid_space() once. So if that's the case, why
> not 
> do it this way?
> 
> Instead, you can get rid of the loop and just do:
> 
> 	asid = find_next_zero_bit(global_asid_used,
> MAX_ASID_AVAILABLE, start);
> 
> If you want, you can warn if asid >= MAX_ASID_AVAILABLE and have some
> fallback. But the loop, is just confusing in my opinion for no
> reason.

I can get rid of the loop. You're right that the code
can just call find_next_zero_bit after calling
reset_global_asid_space.

> 
> > +	/* Slower check to make sure. */
> > +	for_each_cpu(cpu, mm_cpumask(mm)) {
> > +		/* Skip the CPUs that aren't really running this
> > process. */
> > +		if (per_cpu(cpu_tlbstate.loaded_mm, cpu) != mm)
> > +			continue;
> 
> Then perhaps at least add a comment next to loaded_mm, that it's not 
> private per-se, but rarely accessed by other cores?
> 
I don't see any comment in struct tlb_state that
suggests it was ever private to begin with.

Which comment are you referring to that should
be edited?

> > 
> > +
> > +	/*
> > +	 * The transition from IPI TLB flushing, with a dynamic
> > ASID,
> > +	 * and broadcast TLB flushing, using a global ASID, uses
> > memory
> > +	 * ordering for synchronization.
> > +	 *
> > +	 * While the process has threads still using a dynamic
> > ASID,
> > +	 * TLB invalidation IPIs continue to get sent.
> > +	 *
> > +	 * This code sets asid_transition first, before assigning
> > the
> > +	 * global ASID.
> > +	 *
> > +	 * The TLB flush code will only verify the ASID transition
> > +	 * after it has seen the new global ASID for the process.
> > +	 */
> > +	WRITE_ONCE(mm->context.asid_transition, true);
> > +	WRITE_ONCE(mm->context.global_asid, get_global_asid());
> 
> I know it is likely correct in practice (due to TSO memory model),
> but 
> it is not clear, at least for me, how those write order affects the
> rest 
> of the code. I managed to figure out how it relates to the reads in 
> flush_tlb_mm_range() and native_flush_tlb_multi(), but I wouldn't say
> it 
> is trivial and doesn't worth a comment (or smp_wmb/smp_rmb).
> 

What kind of wording should we add here to make it
easier to understand?

"The TLB invalidation code reads these variables in
 the opposite order in which they are written" ?

> > +		/*
> > +		 * If at least one CPU is not using the global
> > ASID yet,
> > +		 * send a TLB flush IPI. The IPI should cause
> > stragglers
> > +		 * to transition soon.
> > +		 *
> > +		 * This can race with the CPU switching to another
> > task;
> > +		 * that results in a (harmless) extra IPI.
> > +		 */
> > +		if (READ_ONCE(per_cpu(cpu_tlbstate.loaded_mm_asid,
> > cpu)) != bc_asid) {
> > +			flush_tlb_multi(mm_cpumask(info->mm),
> > info);
> > +			return;
> 
> I am trying to figure out why we return here. The transition might
> not 
> be over? Why is it "soon"? Wouldn't flush_tlb_func() reload it 
> unconditionally?

The transition _should_ be over, but what if another
CPU got an NMI while in the middle of switch_mm_irqs_off,
and set its own bit in the mm_cpumask after we send this
IPI?

On the other hand, if it sets its mm_cpumask bit after
this point, it will also load the mm->context.global_asid
after this point, and should definitely get the new ASID.

I think we are probably fine to set asid_transition to
false here, but I've had to tweak this code so much over
the past months that I don't feel super confident any more :)

> 
> > +	/*
> > +	 * TLB flushes with INVLPGB are kicked off asynchronously.
> > +	 * The inc_mm_tlb_gen() guarantees page table updates are
> > done
> > +	 * before these TLB flushes happen.
> > +	 */
> > +	if (info->end == TLB_FLUSH_ALL) {
> > +		invlpgb_flush_single_pcid_nosync(kern_pcid(asid));
> > +		/* Do any CPUs supporting INVLPGB need PTI? */
> > +		if (static_cpu_has(X86_FEATURE_PTI))
> > +			invlpgb_flush_single_pcid_nosync(user_pcid
> > (asid));
> > +	} else for (; addr < info->end; addr += nr << info-
> > >stride_shift) {
> 
> I guess I was wrong, and do-while was cleaner here.
> 
> And I guess this is now a bug, if info->stride_shift > PMD_SHIFT...
> 
We set maxnr to 1 for larger stride shifts at the top of the function:

        /* Flushing multiple pages at once is not supported with 1GB
pages. */
        if (info->stride_shift > PMD_SHIFT)
                maxnr = 1;

> [ I guess the cleanest way was to change get_flush_tlb_info to mask
> the 
> low bits of start and end based on ((1ull << stride_shift) - 1). But 
> whatever... ]

I'll change it back :)

I'm just happy this code is getting lots of attention,
and we're improving it with time.

> > @@ -573,6 +874,23 @@ void switch_mm_irqs_off(struct mm_struct
> > *unused, struct mm_struct *next,
> >   				 !cpumask_test_cpu(cpu,
> > mm_cpumask(next))))
> >   			cpumask_set_cpu(cpu, mm_cpumask(next));
> >   
> > +		/*
> > +		 * Check if the current mm is transitioning to a
> > new ASID.
> > +		 */
> > +		if (needs_global_asid_reload(next, prev_asid)) {
> > +			next_tlb_gen = atomic64_read(&next-
> > >context.tlb_gen);
> > +
> > +			choose_new_asid(next, next_tlb_gen,
> > &new_asid, &need_flush);
> > +			goto reload_tlb;
> 
> Not a fan of the goto's when they are not really needed, and I don't 
> think it is really needed here. Especially that the name of the tag 
> "reload_tlb" does not really convey that the page-tables are reloaded
> at 
> that point.

In this particular case, the CPU continues running with
the same page tables, but with a different PCID.

> 
> > +		}
> > +
> > +		/*
> > +		 * Broadcast TLB invalidation keeps this PCID up
> > to date
> > +		 * all the time.
> > +		 */
> > +		if (is_global_asid(prev_asid))
> > +			return;
> 
> Hard for me to convince myself

When a process uses a global ASID, we always send
out TLB invalidations using INVLPGB.

The global ASID should always be up to date.

> 
> > @@ -769,6 +1092,16 @@ static void flush_tlb_func(void *info)
> >   	if (unlikely(loaded_mm == &init_mm))
> >   		return;
> >   
> > +	/* Reload the ASID if transitioning into or out of a
> > global ASID */
> > +	if (needs_global_asid_reload(loaded_mm, loaded_mm_asid)) {
> > +		switch_mm_irqs_off(NULL, loaded_mm, NULL);
> 
> I understand you want to reuse that logic, but it doesn't seem 
> reasonable to me. It both doesn't convey what you want to do, and can
> lead to undesired operations - cpu_tlbstate_update_lam() for
> instance. 
> Probably the impact on performance is minor, but it is an opening for
> future mistakes.

My worry with having a separate code path here is
that the separate code path could bit rot, and we
could introduce bugs that way.

I would rather have a tiny performance impact in
what is a rare code path, than a rare (and hard
to track down) memory corruption due to bit rot.

> 
> > +		loaded_mm_asid =
> > this_cpu_read(cpu_tlbstate.loaded_mm_asid);
> > +	}
> > +
> > +	/* Broadcast ASIDs are always kept up to date with
> > INVLPGB. */
> > +	if (is_global_asid(loaded_mm_asid))
> > +		return;
> 
> The comment does not clarify to me, and I don't manage to clearly 
> explain to myself, why it is guaranteed that all the IPI TLB flushes,
> which were potentially issued before the transition, are not needed.
> 
IPI TLB flushes that were issued before the transition went
to the CPUs when they were using dynamic ASIDs (numbers 1-5).

Reloading the TLB with a different PCID, even pointed at the
same page tables, means that the TLB should load the
translations fresh from the page tables, and not re-use any
that it had previously loaded under a different PCID.

-- 
All Rights Reversed.