[no subject]

**Date** **Thread**

  ->get_on_cpu_info(cpu):
    ->!cpumask_test_and_set(cpu, on_cpu_info_lock)
      ->!test_and_set_bit(cpu, on_cpu_info_lock->bits):
         return (old & mask) != 0;

'mask' always has the CPU bit set, which means that get_on_cpu_info() returns
true if and only if 'old' has the bit clear. I think that prevents a thread
getting the lock if it's already held, so from that point of view it does
function as a per target cpu spinlock. Have I misunderstood something?

Regardless of the above, on_cpu_async() is used like this:

on_cpu_async()
wait_for_synchronization()

so it doesn't make much sense to optimize for performance for the case were
multiple threads call on_cpu_async() concurrently, as they would need to
have to wait for synchronization anyway.

So yes, I'm totally in favour for replacing the per-cpu spinlock with a global
spinlock, even if the only reason is simplifying the code.

> to correct it. Also simplify the break case for on_cpu_async() -
> we don't care if func is NULL, we only care that the cpu is idle.

That makes sense.

> And, finally, add a missing barrier to on_cpu_async().

Might be worth explaining in the commit message why it was missing. Just in
case someone is looking at the code and isn't exactly sure why it's there.

> 
> Fixes: 018550041b38 ("arm/arm64: Remove spinlocks from on_cpu_async")
> Signed-off-by: Andrew Jones <andrew.jones@xxxxxxxxx>
> ---
>  lib/on-cpus.c | 36 +++++++++++-------------------------
>  1 file changed, 11 insertions(+), 25 deletions(-)
> 
> diff --git a/lib/on-cpus.c b/lib/on-cpus.c
> index 892149338419..f6072117fa1b 100644
> --- a/lib/on-cpus.c
> +++ b/lib/on-cpus.c
> @@ -9,6 +9,7 @@
>  #include <on-cpus.h>
>  #include <asm/barrier.h>
>  #include <asm/smp.h>
> +#include <asm/spinlock.h>
>  
>  bool cpu0_calls_idle;
>  
> @@ -18,18 +19,7 @@ struct on_cpu_info {
>  	cpumask_t waiters;
>  };
>  static struct on_cpu_info on_cpu_info[NR_CPUS];
> -static cpumask_t on_cpu_info_lock;
> -
> -static bool get_on_cpu_info(int cpu)
> -{
> -	return !cpumask_test_and_set_cpu(cpu, &on_cpu_info_lock);
> -}
> -
> -static void put_on_cpu_info(int cpu)
> -{
> -	int ret = cpumask_test_and_clear_cpu(cpu, &on_cpu_info_lock);
> -	assert(ret);
> -}
> +static struct spinlock lock;
>  
>  static void __deadlock_check(int cpu, const cpumask_t *waiters, bool *found)
>  {
> @@ -81,18 +71,14 @@ void do_idle(void)
>  	if (cpu == 0)
>  		cpu0_calls_idle = true;
>  
> -	set_cpu_idle(cpu, true);
> -	smp_send_event();
> -
>  	for (;;) {
> +		set_cpu_idle(cpu, true);
> +		smp_send_event();
> +
>  		while (cpu_idle(cpu))
>  			smp_wait_for_event();
>  		smp_rmb();
>  		on_cpu_info[cpu].func(on_cpu_info[cpu].data);
> -		on_cpu_info[cpu].func = NULL;
> -		smp_wmb();

I think the barrier is still needed. The barrier orderered the now removed
write func = NULL before the write set_cpu_idle(), but it also orderered
whatever writes func(data) performed before set_cpu_idle(cpu, true). This
matters for on_cpu(), where I think it's reasonable for the caller to
expect to observe the writes made by 'func' after on_cpu() returns.

If you agree that this is the correct approach, I think it's worth adding a
comment explaining it.

> -		set_cpu_idle(cpu, true);
> -		smp_send_event();
>  	}
>  }
>  
> @@ -110,17 +96,17 @@ void on_cpu_async(int cpu, void (*func)(void *data), void *data)
>  
>  	for (;;) {
>  		cpu_wait(cpu);
> -		if (get_on_cpu_info(cpu)) {
> -			if ((volatile void *)on_cpu_info[cpu].func == NULL)
> -				break;
> -			put_on_cpu_info(cpu);
> -		}
> +		spin_lock(&lock);
> +		if (cpu_idle(cpu))
> +			break;
> +		spin_unlock(&lock);
>  	}
>  
>  	on_cpu_info[cpu].func = func;
>  	on_cpu_info[cpu].data = data;
> +	smp_wmb();

Without this smp_wmb(), it is possible for the target CPU to read an
outdated on_cpu_info[cpu].data. So adding it is the right thing to do,
since it orders the writes to on_cpu_info before set_cpu_idle().

>  	set_cpu_idle(cpu, false);
> -	put_on_cpu_info(cpu);
> +	spin_unlock(&lock);
>  	smp_send_event();

I think a DSB is necessary before all the smp_send_event() calls in this
file. The DSB ensures that the stores to cpu_idle_mask will be observed by
the thread that is waiting on the WFE, otherwise it is theoretically
possible to get a deadlock (in practice this will never happen, because KVM
will be generating the events that cause WFE to complete):

CPU0: on_cpu_async():		CPU1: do_idle():

load CPU1_idle = true
//do stuff
store CPU1_idle=false
SEV
				1: WFE
				   load CPU1_idle=true // old value, allowed
				   b 1b // deadlock

Also, it looks unusual to have smp_send_event() unpaired from set_cpu_idle().
Can't really point to anything being wrong about it though.

Thanks,
Alex