On Wed, Jan 18, 2023 at 10:39:28PM +0000, Joel Fernandes wrote: > On Wed, Jan 18, 2023 at 10:37 PM Joel Fernandes <joel@xxxxxxxxxxxxxxxxx> wrote: > > > > On Tue, Jan 17, 2023 at 08:00:58PM -0800, Paul E. McKenney wrote: > > [...] > > > > > > > Is there a plan to make CPU hotplug failures more frequent? > > > > > > > > > > > > I am not aware of such a plan but I was going by "There are quite some > > > > > > reasons why a CPU-hotplug or a hot-unplug operation can fail, which is > > > > > > not a fatal problem, really." in [1]. > > > > > > > > > > > > What about an rcutorture to skip hotplug for a certain cpu id, > > > > > > rcutorture.skip_hotplug_cpus="0". Can be a last resort. But we/I > > > > > > should debug this issue more before getting to that. > > > > > > > > > > Yes, in fact there already are some checks along those lines, for example, > > > > > the torture_offline() function's check of cpu_is_hotpluggable(). So for > > > > > example, as I understand it, a CONFIG_NO_HZ_FULL=y system should mark > > > > > the housekeeping CPU as !cpu_is_hotpluggable(). > > > > > > > > I don't think CONFIG_NO_HZ_FULL does any such marking (at least I am > > > > not seeing it). Even on x86, if you enable > > > > CONFIG_BOOTPARAM_HOTPLUG_CPU0=y , and CONFIG_NO_HZ_FULL=y, and run > > > > rcutorture with boot args: > > > > > > > > nohz_full=0-3 rcutorture.onoff_interval=100 rcutorture.onoff_holdoff=2 > > > > rcutorture.shutdown_secs=30 > > > > > > > > You will see this in the kernel logs: > > > > [ 2.816022] rcu-torture:torture_onoff task: offline 0 failed: errno -16 > > > > [ 2.975913] rcu-torture:torture_onoff task: offline 0 failed: errno -16 > > > > > > > > So RCU torture test clearly thought the CPUs were hot-pluggable, when > > > > they was chance for them to return -EBUSY (due to housekeeping and > > > > what not). So this issue seems to be architecture independent, in that > > > > sense. > > > > > > > > So the 2 ways forward I see are: > > > > - Make the torture test aware of which CPUs are 'house keeping' > > > > - Make it possible to turn off CPU0 hotplugging on ARM64 by default > > > > (via CONFIG or boot option). > > > > > > > > Another option could be, forgive -EBUSY on CPU0 for > > > > CONFIG_NO_HZ_FULL=y. Is it possible to assign a non-0 CPU id as a > > > > housekeeping CPU? > > > > > > I would be happier to forgive failure to offline housekeeping CPUs than > > > blanket forgiveness of CPU 0. Especially given that I recently got > > > burned by a non-zero boot cpu. ;-) > > > > > > But wouldn't it be even better for cpu_is_hotpluggable() to know the > > > NO_HZ_FULL rules of the road? > > > > That's a great idea. I found a way to do that without having to do the > > EXPORT_SYMBOL (like in Zhouyi's patch). > > > > Would the following be acceptable (only build-tested)? > > > > I can run more tests and submit a patch: > > > > diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c > > index 55405ebf23ab..f73bc520b70e 100644 > > --- a/drivers/base/cpu.c > > +++ b/drivers/base/cpu.c > > @@ -487,7 +487,8 @@ static const struct attribute_group *cpu_root_attr_groups[] = { > > bool cpu_is_hotpluggable(unsigned int cpu) > > { > > struct device *dev = get_cpu_device(cpu); > > - return dev && container_of(dev, struct cpu, dev)->hotpluggable; > > + return dev && container_of(dev, struct cpu, dev)->hotpluggable > > + && !tick_nohz_cpu_hotpluggable(cpu); > > Oops, I should lose that "!" , but otherwise should be ok. Looks plausible to me, but I must defer to Frederic and the various architecture maintainers. Thanx, Paul