Re: [PATCH v6 09/12] x86/mm: enable broadcast TLB invalidation for multi-threaded processes

Nadav Amit <nadav.amit@xxxxxxxxx> · Mon, 20 Jan 2025 22:04:45 +0200

> On 20 Jan 2025, at 18:09, Rik van Riel <riel@xxxxxxxxxxx> wrote:
> 
> On Mon, 2025-01-20 at 16:02 +0200, Nadav Amit wrote:
>> 
>> 
>> On 20/01/2025 4:40, Rik van Riel wrote:
>>> 
>>> +static inline void broadcast_tlb_flush(struct flush_tlb_info
>>> *info)
>>> +{
>>> + VM_WARN_ON_ONCE(1);
>> 
>> Not sure why not the use VM_WARN_ONCE() instead with some more 
>> informative message (anyhow, a string is allocated for it).
>> 
> VM_WARN_ON_ONCE only has a condition, not a message.

Right, my bad.

> 
>>> + /* Slower check to make sure. */
>>> + for_each_cpu(cpu, mm_cpumask(mm)) {
>>> + /* Skip the CPUs that aren't really running this
>>> process. */
>>> + if (per_cpu(cpu_tlbstate.loaded_mm, cpu) != mm)
>>> + continue;
>> 
>> Then perhaps at least add a comment next to loaded_mm, that it's not 
>> private per-se, but rarely accessed by other cores?
>> 
> I don't see any comment in struct tlb_state that
> suggests it was ever private to begin with.
> 
> Which comment are you referring to that should
> be edited?

You can see there is a tlb_state_shared, so one assumes tlb_state is
private... (at least that was my intention separating them).

> 
>>> 
>>> + WRITE_ONCE(mm->context.asid_transition, true);
>>> + WRITE_ONCE(mm->context.global_asid, get_global_asid());
>> 
>> I know it is likely correct in practice (due to TSO memory model),
>> but 
>> it is not clear, at least for me, how those write order affects the
>> rest 
>> of the code. I managed to figure out how it relates to the reads in 
>> flush_tlb_mm_range() and native_flush_tlb_multi(), but I wouldn't say
>> it 
>> is trivial and doesn't worth a comment (or smp_wmb/smp_rmb).
>> 
> 
> What kind of wording should we add here to make it
> easier to understand?
> 
> "The TLB invalidation code reads these variables in
> the opposite order in which they are written" ?

Usually in such cases, you make a reference to wherever there are readers
that rely on the ordering. This is how documenting smp_wmb()/smp_rmb()
ordering is usually done.

> 
> 
>>> 
>>> +		/*
>>> +		 * If at least one CPU is not using the global ASID yet,
>>> +		 * send a TLB flush IPI. The IPI should cause stragglers
>>> +		 * to transition soon.
>>> +		 *
>>> +		 * This can race with the CPU switching to another task;
>>> +		 * that results in a (harmless) extra IPI.
>>> +		 */
>>> +		if (READ_ONCE(per_cpu(cpu_tlbstate.loaded_mm_asid, cpu)) != bc_asid) {
>>> +			flush_tlb_multi(mm_cpumask(info->mm), info);
>>> +			return;
>>> +		}
>> 
>> I am trying to figure out why we return here. The transition might
>> not 
>> be over? Why is it "soon"? Wouldn't flush_tlb_func() reload it 
>> unconditionally?
> 
> The transition _should_ be over, but what if another
> CPU got an NMI while in the middle of switch_mm_irqs_off,
> and set its own bit in the mm_cpumask after we send this
> IPI?
> 
> On the other hand, if it sets its mm_cpumask bit after
> this point, it will also load the mm->context.global_asid
> after this point, and should definitely get the new ASID.
> 
> I think we are probably fine to set asid_transition to
> false here, but I've had to tweak this code so much over
> the past months that I don't feel super confident any more :)

I fully relate, but I am not sure it is that great. The problem
is that nobody would have the guts to change that code later...

>> 
>> And I guess this is now a bug, if info->stride_shift > PMD_SHIFT...
>> 
> We set maxnr to 1 for larger stride shifts at the top of the function:
> 

You’re right, all safe.

> 
>>> + goto reload_tlb;
>> 
>> Not a fan of the goto's when they are not really needed, and I don't 
>> think it is really needed here. Especially that the name of the tag 
>> "reload_tlb" does not really convey that the page-tables are reloaded
>> at 
>> that point.
> 
> In this particular case, the CPU continues running with
> the same page tables, but with a different PCID.

I understand it is “reload_tlb” from your point of view, or from the
point of view of the code that does the “goto”, but if I showed you
the code that follows the “reload_tlb”, I’m not sure you’d know it
is so.

[ snip, taking your valid points ]

>> 
>>> + loaded_mm_asid =
>>> this_cpu_read(cpu_tlbstate.loaded_mm_asid);
>>> + }
>>> +
>>> + /* Broadcast ASIDs are always kept up to date with
>>> INVLPGB. */
>>> + if (is_global_asid(loaded_mm_asid))
>>> + return;
>> 
>> The comment does not clarify to me, and I don't manage to clearly 
>> explain to myself, why it is guaranteed that all the IPI TLB flushes,
>> which were potentially issued before the transition, are not needed.
>> 
> IPI TLB flushes that were issued before the transition went
> to the CPUs when they were using dynamic ASIDs (numbers 1-5).
> 
> Reloading the TLB with a different PCID, even pointed at the
> same page tables, means that the TLB should load the
> translations fresh from the page tables, and not re-use any
> that it had previously loaded under a different PCID.
> 

What about this scenario for instance?

CPU0                  CPU1                      CPU2
----                  ----                      ----
(1) use_global_asid(mm):        
    mm->context.asid_trans = T;
    mm->context.global_asid = G;

                      (2) switch_mm(..., next=mm):
                          *Observes global_asid = G
                          => loads CR3 with PCID=G
                          => fills TLB under G.
                          TLB caches PTE[G, V] = P
			  (for some reason)

                                             (3) flush_tlb_mm_range(mm):
                                                 *Sees global_asid == 0
                                                   (stale/old value)
                                                 => flush_tlb_multi()
                                                 => IPI flush for dyn.

                      (4) IPI arrives on CPU1:
                          flush_tlb_func(...): 
                          is_global_asid(G)? yes,
                          skip invalidate; broadcast
                          flush assumed to cover it.

                                             (5) IPI completes on CPU2:
                                                 Dyn. ASIDs are flushed, 
                                                 but CPU1’s global ASID
                                                 was never invalidated!

                      (6) CPU1 uses stale TLB entries under ASID G.
                          TLB continues to use PTE[G, V] = P, as it
                          was not invalidated.