Hi Drew, On Tue, Oct 29, 2024 at 11:56:58AM +0100, Andrew Jones wrote: > On Fri, Oct 25, 2024 at 01:19:26PM +0100, Alexandru Elisei wrote: > > Hi Drew, > > > > I've been paging in all the on_cpu* machinery, and it occurred to me that > > we have a chance to simplify the code and to remove a duplicate interface > > by not exposing smp_boot_secondary() to the tests, as the on_cpu* functions > > serve the same purpose. With this change, we can remove the entry argument > > to smp_boot_secondary(), and the assembly for secondary_entry can be made > > simpler by eliminating the branch to the entry function. > > You're right that smp_boot_secondary() doesn't appear to getting use, but > a goal for the library code is to be useful and with defaults that will > satisfy nearly all unit tests, but to never restrict a unit test to > having to use it. Exposing calls like smp_boot_secondary() give unit tests > the ability to eliminate as much library code as possible from their paths > without having to duplicate the few lines needed to "boot" a secondary. Yep, that makes sense. > > > > > Do you think that would be something worth pursuing? I can have a look at > > it. > > > > There are exactly two places where smp_boot_secondary() is used: in > > arm/psci.c and arm/gic.c, and from a quick glance it looks to me like those > > can be replaced with one of the on_cpu* functions. > > > > On Wed, Oct 23, 2024 at 03:17:20PM +0200, Andrew Jones wrote: > > > get/put_on_cpu_info() were trying to provide per-cpu locking for > > > the per-cpu on_cpu info, but they were flawed since they would > > > always set the "lock" since they were treating test_and_set/clear > > > as cmpxchg (which they're not). Just revert to a normal spinlock > > > > Would you mind expanding on that a bit more? > > > > From my understanding of the code, on arm64, this is the call chain that I get > > for get_on_cpu_info(cpu): > > > > ->get_on_cpu_info(cpu): > > ->!cpumask_test_and_set(cpu, on_cpu_info_lock) > > ->!test_and_set_bit(cpu, on_cpu_info_lock->bits): > > return (old & mask) != 0; > > > > 'mask' always has the CPU bit set, which means that get_on_cpu_info() returns > > true if and only if 'old' has the bit clear. I think that prevents a thread > > getting the lock if it's already held, so from that point of view it does > > function as a per target cpu spinlock. Have I misunderstood something? > > No, you're right, and thanks for reminding me of the thought process I > used when I wrote get/put_on_cpu_info() in the first place :-) While > trying to simplify things, I just got fed up with staring at it and > reasoning about it, so I threw my hands up and went back to a good ol' > lock. Now that you've convinced me [again] that the test-and-set isn't > flawed, we could keep it, but... > > > > > Regardless of the above, on_cpu_async() is used like this: > > > > on_cpu_async() > > wait_for_synchronization() > > > > so it doesn't make much sense to optimize for performance for the case were > > multiple threads call on_cpu_async() concurrently, as they would need to > > have to wait for synchronization anyway. > > ...even though on_cpu_async() has a wait part, it only waits (outside the > lock) for the target cpu to be idle. If all targets are idle already, and > we have more than one "launcher" cpu, then we could launch tests on all > the targets more-or-less simultaneously with percpu locks, but... Ahah, that's very interesting, hadn't though about it that way. Another interesting thing here is that the glanularity for ATOMIC_TESTOP is 8 * 8 = 64 bits, because LDXR loads from an 8 byte address, which is then marked for exclusive access. I think that means that in the end, if you have less than 64 CPUs, then the per-cpu spinlock will end looking similar to a spinlock. > > > > > So yes, I'm totally in favour for replacing the per-cpu spinlock with a global > > spinlock, even if the only reason is simplifying the code. > > ...simplifying the code is probably the better choice since the critical > section is so small that sharing a lock really shouldn't matter much and > even some contention test which needs to run simultaneously on several > cpus should use looping rather than rely on a simultaneous launch. Yes, please use a spinlock here. > > > > > > to correct it. Also simplify the break case for on_cpu_async() - > > > we don't care if func is NULL, we only care that the cpu is idle. > > > > That makes sense. > > > > > And, finally, add a missing barrier to on_cpu_async(). > > > > Might be worth explaining in the commit message why it was missing. Just in > > case someone is looking at the code and isn't exactly sure why it's there. > > Sure. I think the reason it's missing is because when 018550041b38 changed > from spinlock to put_on_cpu_info(), the release was also moved below the > clearing of idle. Prior to that, the spin_unlock() was acting as barrier. > With the move of where the release was done a barrier should have been > added. > > > > > > > > > Fixes: 018550041b38 ("arm/arm64: Remove spinlocks from on_cpu_async") > > > Signed-off-by: Andrew Jones <andrew.jones@xxxxxxxxx> > > > --- > > > lib/on-cpus.c | 36 +++++++++++------------------------- > > > 1 file changed, 11 insertions(+), 25 deletions(-) > > > > > > diff --git a/lib/on-cpus.c b/lib/on-cpus.c > > > index 892149338419..f6072117fa1b 100644 > > > --- a/lib/on-cpus.c > > > +++ b/lib/on-cpus.c > > > @@ -9,6 +9,7 @@ > > > #include <on-cpus.h> > > > #include <asm/barrier.h> > > > #include <asm/smp.h> > > > +#include <asm/spinlock.h> > > > > > > bool cpu0_calls_idle; > > > > > > @@ -18,18 +19,7 @@ struct on_cpu_info { > > > cpumask_t waiters; > > > }; > > > static struct on_cpu_info on_cpu_info[NR_CPUS]; > > > -static cpumask_t on_cpu_info_lock; > > > - > > > -static bool get_on_cpu_info(int cpu) > > > -{ > > > - return !cpumask_test_and_set_cpu(cpu, &on_cpu_info_lock); > > > -} > > > - > > > -static void put_on_cpu_info(int cpu) > > > -{ > > > - int ret = cpumask_test_and_clear_cpu(cpu, &on_cpu_info_lock); > > > - assert(ret); > > > -} > > > +static struct spinlock lock; > > > > > > static void __deadlock_check(int cpu, const cpumask_t *waiters, bool *found) > > > { > > > @@ -81,18 +71,14 @@ void do_idle(void) > > > if (cpu == 0) > > > cpu0_calls_idle = true; > > > > > > - set_cpu_idle(cpu, true); > > > - smp_send_event(); > > > - > > > for (;;) { > > > + set_cpu_idle(cpu, true); > > > + smp_send_event(); > > > + > > > while (cpu_idle(cpu)) > > > smp_wait_for_event(); > > > smp_rmb(); > > > on_cpu_info[cpu].func(on_cpu_info[cpu].data); > > > - on_cpu_info[cpu].func = NULL; > > > - smp_wmb(); > > > > I think the barrier is still needed. The barrier orderered the now removed > > write func = NULL before the write set_cpu_idle(), but it also orderered > > whatever writes func(data) performed before set_cpu_idle(cpu, true). This > > matters for on_cpu(), where I think it's reasonable for the caller to > > expect to observe the writes made by 'func' after on_cpu() returns. > > > > If you agree that this is the correct approach, I think it's worth adding a > > comment explaining it. > > I think that's reasonable (along with adding smp_rmb()'s to the bottom of > on_cpu() and on_cpumask()). So the idea is we have I don't think adding smp_rmb() to on_cpu() and on_cpumask() is strictly necessary, because cpumask_set_cpu()->set_bit() already has a smp_mb(). > > cpu1 cpu2 > ---- ---- > func() /* store something */ /* wait for cpu1 to be idle */ > smp_wmb(); smp_rmb(); > set_cpu_idle(smp_processor_id(), true); /* load what func() stored */ > Just one small thing here, so it's even more precise what is being ordered: the smp_rmb() on cpu2 orders the read from cpu_online_mask() before the load from what func() stored. > > > > > - set_cpu_idle(cpu, true); > > > - smp_send_event(); > > > } > > > } > > > > > > @@ -110,17 +96,17 @@ void on_cpu_async(int cpu, void (*func)(void *data), void *data) > > > > > > for (;;) { > > > cpu_wait(cpu); > > > - if (get_on_cpu_info(cpu)) { > > > - if ((volatile void *)on_cpu_info[cpu].func == NULL) > > > - break; > > > - put_on_cpu_info(cpu); > > > - } > > > + spin_lock(&lock); > > > + if (cpu_idle(cpu)) > > > + break; > > > + spin_unlock(&lock); > > > } > > > > > > on_cpu_info[cpu].func = func; > > > on_cpu_info[cpu].data = data; > > > + smp_wmb(); > > > > Without this smp_wmb(), it is possible for the target CPU to read an > > outdated on_cpu_info[cpu].data. So adding it is the right thing to do, > > since it orders the writes to on_cpu_info before set_cpu_idle(). > > > > > set_cpu_idle(cpu, false); > > > - put_on_cpu_info(cpu); > > > + spin_unlock(&lock); > > > smp_send_event(); > > > > I think a DSB is necessary before all the smp_send_event() calls in this > > file. The DSB ensures that the stores to cpu_idle_mask will be observed by > > the thread that is waiting on the WFE, otherwise it is theoretically > > possible to get a deadlock (in practice this will never happen, because KVM > > will be generating the events that cause WFE to complete): > > > > CPU0: on_cpu_async(): CPU1: do_idle(): > > > > load CPU1_idle = true > > //do stuff > > store CPU1_idle=false > > SEV > > 1: WFE > > load CPU1_idle=true // old value, allowed > > b 1b // deadlock > > Good catch. Can you send a patch for that? The patch should be for the arm > implementations of smp_send_event() (and smp_wait_for_event()?) in > lib/arm/asm/smp.h. Sure, I can do that. Should I wait until this series gets merged? Also, I don't think smp_wait_for_event() needs a barrier - a wmb() before smp_send_event() will make sure that all stores have **completed** before smp_send_event() is executed, so on the waiting CPU all the stores will be visible. Using a smp_wmp() (or smp_mb()) will not work, because smp_send_event() is not a memory operation, and the smp_* primitives won't order the memory accesses against it. > > > > > Also, it looks unusual to have smp_send_event() unpaired from set_cpu_idle(). > > Can't really point to anything being wrong about it though. > > You mean due to the spin_unlock() separating the set_cpu_idle() and > smp_send_event()? We could put the spin_unlock() above the set_cpu_idle() > (as it was in 018550041b38, which also removes the need for the wmb), but > I think I like it the way it is now better for better readability. I > also can't think of why it would matter for them to be unpaired. Then choose whatever looks best for you :) Thanks, Alex