Re: [PATCH v9 09/12] x86/mm: enable broadcast TLB invalidation for multi-threaded processes

Brendan Jackman <jackmanb@xxxxxxxxxx> · Mon, 10 Feb 2025 15:15:15 +0100

On Thu, 6 Feb 2025 at 05:47, Rik van Riel <riel@xxxxxxxxxxx> wrote:
> +static u16 get_global_asid(void)
> +{
> +
> +       u16 asid;
> +
> +       lockdep_assert_held(&global_asid_lock);
> +
> +       /* The previous allocated ASID is at the top of the address space. */
> +       if (last_global_asid >= MAX_ASID_AVAILABLE - 1)
> +               reset_global_asid_space();
> +
> +       asid = find_next_zero_bit(global_asid_used, MAX_ASID_AVAILABLE, last_global_asid);
> +
> +       if (asid >= MAX_ASID_AVAILABLE) {
> +               /* This should never happen. */
> +               VM_WARN_ONCE(1, "Unable to allocate global ASID despite %d available\n", global_asid_available);

If you'll forgive the nitpicking, please put the last arg on a new
line or otherwise break this up, the rest of this file keeps below 100
chars (this is 113).

> +               return 0;
> +       }
> +
> +       /* Claim this global ASID. */
> +       __set_bit(asid, global_asid_used);
> +       last_global_asid = asid;
> +       global_asid_available--;
> +       return asid;
> +}
> +
> +/*
> + * Returns true if the mm is transitioning from a CPU-local ASID to a global
> + * (INVLPGB) ASID, or the other way around.
> + */
> +static bool needs_global_asid_reload(struct mm_struct *next, u16 prev_asid)
> +{
> +       u16 global_asid = mm_global_asid(next);
> +
> +       if (global_asid && prev_asid != global_asid)
> +               return true;
> +
> +       if (!global_asid && is_global_asid(prev_asid))
> +               return true;

I think this needs clarification around when switches from
global->nonglobal happen. Maybe commentary or maybe there's a way to
just express the code that makes it obvious. Here's what I currently
understand, please correct me if I'm wrong:

- Once a process gets a global ASID it keeps it forever. So within a
process we never switch global->nonglobal.

- In flush_tlb_func() we are just calling this to check if the process
has just been given a global ASID - there's no way loaded_mm_asid can
be global yet !mm_global_asid(loaded_mm).

- When we call this from switch_mm_irqs_off() we are in the prev==next
case. Is there something about lazy TLB that can cause the case above
to happen here?

> +static bool meets_global_asid_threshold(struct mm_struct *mm)
> +{
> +       if (!global_asid_available)

I think we need READ_ONCE here.

Also - this doesn't really make sense in this function as it's currently named.

I think we could just inline this whole function into
consider_global_asid(), it would still be nice and readable IMO.

> @@ -786,6 +1101,8 @@ static void flush_tlb_func(void *info)
>                 return;
>         }
>
> +       local_tlb_gen = this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].tlb_gen);
> +
>         if (unlikely(f->new_tlb_gen != TLB_GENERATION_INVALID &&
>                      f->new_tlb_gen <= local_tlb_gen)) {
>                 /*
> @@ -953,7 +1270,7 @@ STATIC_NOPV void native_flush_tlb_multi(const struct cpumask *cpumask,
>          * up on the new contents of what used to be page tables, while
>          * doing a speculative memory access.
>          */
> -       if (info->freed_tables)
> +       if (info->freed_tables || in_asid_transition(info->mm))
>                 on_each_cpu_mask(cpumask, flush_tlb_func, (void *)info, true);
>         else
>                 on_each_cpu_cond_mask(should_flush_tlb, flush_tlb_func,
> @@ -1058,9 +1375,12 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
>          * a local TLB flush is needed. Optimize this use-case by calling
>          * flush_tlb_func_local() directly in this case.
>          */
> -       if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) {
> +       if (mm_global_asid(mm)) {
> +               broadcast_tlb_flush(info);
> +       } else if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) {
>                 info->trim_cpumask = should_trim_cpumask(mm);
>                 flush_tlb_multi(mm_cpumask(mm), info);
> +               consider_global_asid(mm);

Why do we do this here instead of when the CPU enters the mm? Is the
idea that in combination with the jiffies thing in
consider_global_asid() we get a probability of getting a global ASID
(within some time period) that scales with the amount of TLB flushing
the process does? So then we avoid using up ASID space on processes
that are multithreaded but just sit around with stable VMAs etc?

I guess another reason would be in the bizarre case that we ran out of
global ASIDs and then entered one big workload that effectively owns
all the CPUs, that big workload can still get a global ASID later once
the old processes free them up, even if we enter it before
reset_global_asid_space().

Just to be clear - this isn't an objection, I just wanna see if I
actually understood the design.

I guess it would be worth having a comment about it - especially if
I'm missing something or totally wrong.