->get_on_cpu_info(cpu): ->!cpumask_test_and_set(cpu, on_cpu_info_lock) ->!test_and_set_bit(cpu, on_cpu_info_lock->bits): return (old & mask) != 0; 'mask' always has the CPU bit set, which means that get_on_cpu_info() returns true if and only if 'old' has the bit clear. I think that prevents a thread getting the lock if it's already held, so from that point of view it does function as a per target cpu spinlock. Have I misunderstood something? Regardless of the above, on_cpu_async() is used like this: on_cpu_async() wait_for_synchronization() so it doesn't make much sense to optimize for performance for the case were multiple threads call on_cpu_async() concurrently, as they would need to have to wait for synchronization anyway. So yes, I'm totally in favour for replacing the per-cpu spinlock with a global spinlock, even if the only reason is simplifying the code. > to correct it. Also simplify the break case for on_cpu_async() - > we don't care if func is NULL, we only care that the cpu is idle. That makes sense. > And, finally, add a missing barrier to on_cpu_async(). Might be worth explaining in the commit message why it was missing. Just in case someone is looking at the code and isn't exactly sure why it's there. > > Fixes: 018550041b38 ("arm/arm64: Remove spinlocks from on_cpu_async") > Signed-off-by: Andrew Jones <andrew.jones@xxxxxxxxx> > --- > lib/on-cpus.c | 36 +++++++++++------------------------- > 1 file changed, 11 insertions(+), 25 deletions(-) > > diff --git a/lib/on-cpus.c b/lib/on-cpus.c > index 892149338419..f6072117fa1b 100644 > --- a/lib/on-cpus.c > +++ b/lib/on-cpus.c > @@ -9,6 +9,7 @@ > #include <on-cpus.h> > #include <asm/barrier.h> > #include <asm/smp.h> > +#include <asm/spinlock.h> > > bool cpu0_calls_idle; > > @@ -18,18 +19,7 @@ struct on_cpu_info { > cpumask_t waiters; > }; > static struct on_cpu_info on_cpu_info[NR_CPUS]; > -static cpumask_t on_cpu_info_lock; > - > -static bool get_on_cpu_info(int cpu) > -{ > - return !cpumask_test_and_set_cpu(cpu, &on_cpu_info_lock); > -} > - > -static void put_on_cpu_info(int cpu) > -{ > - int ret = cpumask_test_and_clear_cpu(cpu, &on_cpu_info_lock); > - assert(ret); > -} > +static struct spinlock lock; > > static void __deadlock_check(int cpu, const cpumask_t *waiters, bool *found) > { > @@ -81,18 +71,14 @@ void do_idle(void) > if (cpu == 0) > cpu0_calls_idle = true; > > - set_cpu_idle(cpu, true); > - smp_send_event(); > - > for (;;) { > + set_cpu_idle(cpu, true); > + smp_send_event(); > + > while (cpu_idle(cpu)) > smp_wait_for_event(); > smp_rmb(); > on_cpu_info[cpu].func(on_cpu_info[cpu].data); > - on_cpu_info[cpu].func = NULL; > - smp_wmb(); I think the barrier is still needed. The barrier orderered the now removed write func = NULL before the write set_cpu_idle(), but it also orderered whatever writes func(data) performed before set_cpu_idle(cpu, true). This matters for on_cpu(), where I think it's reasonable for the caller to expect to observe the writes made by 'func' after on_cpu() returns. If you agree that this is the correct approach, I think it's worth adding a comment explaining it. > - set_cpu_idle(cpu, true); > - smp_send_event(); > } > } > > @@ -110,17 +96,17 @@ void on_cpu_async(int cpu, void (*func)(void *data), void *data) > > for (;;) { > cpu_wait(cpu); > - if (get_on_cpu_info(cpu)) { > - if ((volatile void *)on_cpu_info[cpu].func == NULL) > - break; > - put_on_cpu_info(cpu); > - } > + spin_lock(&lock); > + if (cpu_idle(cpu)) > + break; > + spin_unlock(&lock); > } > > on_cpu_info[cpu].func = func; > on_cpu_info[cpu].data = data; > + smp_wmb(); Without this smp_wmb(), it is possible for the target CPU to read an outdated on_cpu_info[cpu].data. So adding it is the right thing to do, since it orders the writes to on_cpu_info before set_cpu_idle(). > set_cpu_idle(cpu, false); > - put_on_cpu_info(cpu); > + spin_unlock(&lock); > smp_send_event(); I think a DSB is necessary before all the smp_send_event() calls in this file. The DSB ensures that the stores to cpu_idle_mask will be observed by the thread that is waiting on the WFE, otherwise it is theoretically possible to get a deadlock (in practice this will never happen, because KVM will be generating the events that cause WFE to complete): CPU0: on_cpu_async(): CPU1: do_idle(): load CPU1_idle = true //do stuff store CPU1_idle=false SEV 1: WFE load CPU1_idle=true // old value, allowed b 1b // deadlock Also, it looks unusual to have smp_send_event() unpaired from set_cpu_idle(). Can't really point to anything being wrong about it though. Thanks, Alex