Re: [PATCH 09/12] x86/mm: enable broadcast TLB invalidation for multi-threaded processes

Jann Horn <jannh@xxxxxxxxxx> · Fri, 3 Jan 2025 18:36:06 +0100

On Mon, Dec 30, 2024 at 6:53 PM Rik van Riel <riel@xxxxxxxxxxx> wrote:
> Use broadcast TLB invalidation, using the INVPLGB instruction, on AMD EPYC 3
> and newer CPUs.
>
> In order to not exhaust PCID space, and keep TLB flushes local for single
> threaded processes, we only hand out broadcast ASIDs to processes active on
> 3 or more CPUs, and gradually increase the threshold as broadcast ASID space
> is depleted.
[...]
> ---
>  arch/x86/include/asm/mmu.h         |   6 +
>  arch/x86/include/asm/mmu_context.h |  12 ++
>  arch/x86/include/asm/tlbflush.h    |  17 ++
>  arch/x86/mm/tlb.c                  | 310 ++++++++++++++++++++++++++++-
>  4 files changed, 336 insertions(+), 9 deletions(-)
>
> diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h
> index 3b496cdcb74b..a8e8dfa5a520 100644
> --- a/arch/x86/include/asm/mmu.h
> +++ b/arch/x86/include/asm/mmu.h
> @@ -48,6 +48,12 @@ typedef struct {
>         unsigned long flags;
>  #endif
>
> +#ifdef CONFIG_CPU_SUP_AMD
> +       struct list_head broadcast_asid_list;
> +       u16 broadcast_asid;
> +       bool asid_transition;

Please add a comment on the semantics of the "asid_transition" field
here after addressing the comments below.

> +#endif
> +
>  #ifdef CONFIG_ADDRESS_MASKING
>         /* Active LAM mode:  X86_CR3_LAM_U48 or X86_CR3_LAM_U57 or 0 (disabled) */
>         unsigned long lam_cr3_mask;
[...]
> +#ifdef CONFIG_CPU_SUP_AMD
> +/*
> + * Logic for AMD INVLPGB support.
> + */
> +static DEFINE_RAW_SPINLOCK(broadcast_asid_lock);
> +static u16 last_broadcast_asid = TLB_NR_DYN_ASIDS;

I wonder if this should be set to MAX_ASID_AVAILABLE or such to ensure
that we do a flush before we start using the broadcast ASID space the
first time... Or is there something else that already guarantees that
all ASIDs of the TLB are flushed during kernel boot?

> +static DECLARE_BITMAP(broadcast_asid_used, MAX_ASID_AVAILABLE) = { 0 };
> +static LIST_HEAD(broadcast_asid_list);
> +static int broadcast_asid_available = MAX_ASID_AVAILABLE - TLB_NR_DYN_ASIDS - 1;
> +
> +static void reset_broadcast_asid_space(void)
> +{
> +       mm_context_t *context;
> +
> +       lockdep_assert_held(&broadcast_asid_lock);
> +
> +       /*
> +        * Flush once when we wrap around the ASID space, so we won't need
> +        * to flush every time we allocate an ASID for boradcast flushing.

nit: typoed "broadcast"

> +        */
> +       invlpgb_flush_all_nonglobals();
> +       tlbsync();
> +
> +       /*
> +        * Leave the currently used broadcast ASIDs set in the bitmap, since
> +        * those cannot be reused before the next wraparound and flush..
> +        */
> +       bitmap_clear(broadcast_asid_used, 0, MAX_ASID_AVAILABLE);
> +       list_for_each_entry(context, &broadcast_asid_list, broadcast_asid_list)
> +               __set_bit(context->broadcast_asid, broadcast_asid_used);
> +
> +       last_broadcast_asid = TLB_NR_DYN_ASIDS;
> +}
> +
> +static u16 get_broadcast_asid(void)
> +{
> +       lockdep_assert_held(&broadcast_asid_lock);
> +
> +       do {
> +               u16 start = last_broadcast_asid;
> +               u16 asid = find_next_zero_bit(broadcast_asid_used, MAX_ASID_AVAILABLE, start);
> +
> +               if (asid >= MAX_ASID_AVAILABLE) {
> +                       reset_broadcast_asid_space();
> +                       continue;

Can this loop endlessly without making forward progress if we have a
few thousand processes on the system that are multi-threaded (or used
to be multi-threaded) and race the wrong way?
meets_broadcast_asid_threshold() checks if we have free IDs remaining,
but that check happens before broadcast_asid_lock is held, so we could
theoretically race such that no free IDs are available, right?

> +               }
> +
> +               /* Try claiming this broadcast ASID. */
> +               if (!test_and_set_bit(asid, broadcast_asid_used)) {
> +                       last_broadcast_asid = asid;
> +                       return asid;
> +               }
> +       } while (1);
> +}
[...]
> +/*
> + * Assign a broadcast ASID to the current process, protecting against
> + * races between multiple threads in the process.
> + */
> +static void use_broadcast_asid(struct mm_struct *mm)
> +{
> +       guard(raw_spinlock_irqsave)(&broadcast_asid_lock);
> +
> +       /* This process is already using broadcast TLB invalidation. */
> +       if (mm->context.broadcast_asid)
> +               return;
> +
> +       mm->context.broadcast_asid = get_broadcast_asid();
> +       mm->context.asid_transition = true;

This looks buggy to me: If we first set mm->context.broadcast_asid and
then later set mm->context.asid_transition, then a
flush_tlb_mm_range() that happens in between will observe
mm_broadcast_asid() being true (meaning broadcast invalidation should
be used) while mm->context.asid_transition is false (meaning broadcast
invalidation alone is sufficient); but actually we haven't even
started transitioning CPUs over to the new ASID yet, so I think the
flush does nothing?

Maybe change how mm->context.asid_transition works such that it is
immediately set on mm creation and cleared when the transition is
done, so that you don't have to touch it here?

Also, please use at least WRITE_ONCE() for writes here, and add
comments documenting ordering requirements.

> +       list_add(&mm->context.broadcast_asid_list, &broadcast_asid_list);
> +       broadcast_asid_available--;
> +}
[...]
> +static void finish_asid_transition(struct flush_tlb_info *info)
> +{
> +       struct mm_struct *mm = info->mm;
> +       int bc_asid = mm_broadcast_asid(mm);
> +       int cpu;
> +
> +       if (!mm->context.asid_transition)

AFAIU this can be accessed concurrently - please use at least
READ_ONCE(). (I think in the current version of the patch, this needs
to be ordered against the preceding mm_broadcast_asid() read, but
that's implicit on x86, so I guess writing a barrier here would be
superfluous.)

> +               return;
> +
> +       for_each_cpu(cpu, mm_cpumask(mm)) {
> +               if (READ_ONCE(per_cpu(cpu_tlbstate.loaded_mm, cpu)) != mm)
> +                       continue;

switch_mm_irqs_off() picks an ASID and writes CR3 before writing loaded_mm:
"/* Make sure we write CR3 before loaded_mm. */"

Can we race with a concurrent switch_mm_irqs_off() on the other CPU
such that the other CPU has already switched CR3 to our MM using the
old ASID, but has not yet written loaded_mm, such that we skip it
here? And then we'll think we finished the ASID transition, and the
next time we do a flush, we'll wrongly omit the flush for that other
CPU even though it's still using the old ASID?

> +
> +               /*
> +                * If at least one CPU is not using the broadcast ASID yet,
> +                * send a TLB flush IPI. The IPI should cause stragglers
> +                * to transition soon.
> +                */
> +               if (per_cpu(cpu_tlbstate.loaded_mm_asid, cpu) != bc_asid) {

READ_ONCE()? Also, I think this needs a comment explaining that this
can race with concurrent MM switches such that we wrongly think that
there's a straggler (because we're not reading the loaded_mm and the
loaded_mm_asid as one atomic combination).

> +                       flush_tlb_multi(mm_cpumask(info->mm), info);
> +                       return;
> +               }
> +       }
> +
> +       /* All the CPUs running this process are using the broadcast ASID. */
> +       mm->context.asid_transition = 0;

WRITE_ONCE()?
Also: This is a bool, please use "false".

> +}
> +
> +static void broadcast_tlb_flush(struct flush_tlb_info *info)
> +{
> +       bool pmd = info->stride_shift == PMD_SHIFT;
> +       unsigned long maxnr = invlpgb_count_max;
> +       unsigned long asid = info->mm->context.broadcast_asid;
> +       unsigned long addr = info->start;
> +       unsigned long nr;
> +
> +       /* Flushing multiple pages at once is not supported with 1GB pages. */
> +       if (info->stride_shift > PMD_SHIFT)
> +               maxnr = 1;
> +
> +       if (info->end == TLB_FLUSH_ALL) {
> +               invlpgb_flush_single_pcid(kern_pcid(asid));

What orders this flush with the preceding page table update? Does the
instruction implicitly get ordered after preceding memory writes, or
do we get that ordering from inc_mm_tlb_gen() or something like that?

> +               /* Do any CPUs supporting INVLPGB need PTI? */
> +               if (static_cpu_has(X86_FEATURE_PTI))
> +                       invlpgb_flush_single_pcid(user_pcid(asid));
> +       } else do {
> +               /*
> +                * Calculate how many pages can be flushed at once; if the
> +                * remainder of the range is less than one page, flush one.
> +                */
> +               nr = min(maxnr, (info->end - addr) >> info->stride_shift);
> +               nr = max(nr, 1);
> +
> +               invlpgb_flush_user_nr(kern_pcid(asid), addr, nr, pmd);
> +               /* Do any CPUs supporting INVLPGB need PTI? */
> +               if (static_cpu_has(X86_FEATURE_PTI))
> +                       invlpgb_flush_user_nr(user_pcid(asid), addr, nr, pmd);
> +               addr += nr << info->stride_shift;
> +       } while (addr < info->end);
> +
> +       finish_asid_transition(info);
> +
> +       /* Wait for the INVLPGBs kicked off above to finish. */
> +       tlbsync();
> +}
> +#endif /* CONFIG_CPU_SUP_AMD */
[...]
> @@ -769,6 +1042,16 @@ static void flush_tlb_func(void *info)
>         if (unlikely(loaded_mm == &init_mm))
>                 return;
>
> +       /* Reload the ASID if transitioning into or out of a broadcast ASID */
> +       if (needs_broadcast_asid_reload(loaded_mm, loaded_mm_asid)) {
> +               switch_mm_irqs_off(NULL, loaded_mm, NULL);
> +               loaded_mm_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
> +       }
> +
> +       /* Broadcast ASIDs are always kept up to date with INVLPGB. */
> +       if (is_broadcast_asid(loaded_mm_asid))
> +               return;

This relies on the mm_broadcast_asid() read in flush_tlb_mm_range()
being ordered after the page table update, correct? And we get that
required ordering from the inc_mm_tlb_gen(), which implies a full
barrier? It might be nice if there were some more comments on this.

>         VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].ctx_id) !=
>                    loaded_mm->context.ctx_id);
>