On Thu, Jan 19, 2023 at 4:26 PM Joel Fernandes <joel@xxxxxxxxxxxxxxxxx> wrote: > > > > > On Jan 18, 2023, at 10:21 PM, Zhouyi Zhou <zhouzhouyi@xxxxxxxxx> wrote: > > > > On Thu, Jan 19, 2023 at 6:39 AM Joel Fernandes <joel@xxxxxxxxxxxxxxxxx> wrote: > >> > >>> On Wed, Jan 18, 2023 at 10:37 PM Joel Fernandes <joel@xxxxxxxxxxxxxxxxx> wrote: > >>> > >>>> On Tue, Jan 17, 2023 at 08:00:58PM -0800, Paul E. McKenney wrote: > >>> [...] > >>>>>>>> Is there a plan to make CPU hotplug failures more frequent? > >>>>>>> > >>>>>>> I am not aware of such a plan but I was going by "There are quite some > >>>>>>> reasons why a CPU-hotplug or a hot-unplug operation can fail, which is > >>>>>>> not a fatal problem, really." in [1]. > >>>>>>> > >>>>>>> What about an rcutorture to skip hotplug for a certain cpu id, > >>>>>>> rcutorture.skip_hotplug_cpus="0". Can be a last resort. But we/I > >>>>>>> should debug this issue more before getting to that. > >>>>>> > >>>>>> Yes, in fact there already are some checks along those lines, for example, > >>>>>> the torture_offline() function's check of cpu_is_hotpluggable(). So for > >>>>>> example, as I understand it, a CONFIG_NO_HZ_FULL=y system should mark > >>>>>> the housekeeping CPU as !cpu_is_hotpluggable(). > >>>>> > >>>>> I don't think CONFIG_NO_HZ_FULL does any such marking (at least I am > >>>>> not seeing it). Even on x86, if you enable > >>>>> CONFIG_BOOTPARAM_HOTPLUG_CPU0=y , and CONFIG_NO_HZ_FULL=y, and run > >>>>> rcutorture with boot args: > >>>>> > >>>>> nohz_full=0-3 rcutorture.onoff_interval=100 rcutorture.onoff_holdoff=2 > >>>>> rcutorture.shutdown_secs=30 > >>>>> > >>>>> You will see this in the kernel logs: > >>>>> [ 2.816022] rcu-torture:torture_onoff task: offline 0 failed: errno -16 > >>>>> [ 2.975913] rcu-torture:torture_onoff task: offline 0 failed: errno -16 > >>>>> > >>>>> So RCU torture test clearly thought the CPUs were hot-pluggable, when > >>>>> they was chance for them to return -EBUSY (due to housekeeping and > >>>>> what not). So this issue seems to be architecture independent, in that > >>>>> sense. > >>>>> > >>>>> So the 2 ways forward I see are: > >>>>> - Make the torture test aware of which CPUs are 'house keeping' > >>>>> - Make it possible to turn off CPU0 hotplugging on ARM64 by default > >>>>> (via CONFIG or boot option). > >>>>> > >>>>> Another option could be, forgive -EBUSY on CPU0 for > >>>>> CONFIG_NO_HZ_FULL=y. Is it possible to assign a non-0 CPU id as a > >>>>> housekeeping CPU? > >>>> > >>>> I would be happier to forgive failure to offline housekeeping CPUs than > >>>> blanket forgiveness of CPU 0. Especially given that I recently got > >>>> burned by a non-zero boot cpu. ;-) > >>>> > >>>> But wouldn't it be even better for cpu_is_hotpluggable() to know the > >>>> NO_HZ_FULL rules of the road? > >>> > >>> That's a great idea. I found a way to do that without having to do the > >>> EXPORT_SYMBOL (like in Zhouyi's patch). > >>> > >>> Would the following be acceptable (only build-tested)? > >>> > >>> I can run more tests and submit a patch: > >>> > >>> diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c > >>> index 55405ebf23ab..f73bc520b70e 100644 > >>> --- a/drivers/base/cpu.c > >>> +++ b/drivers/base/cpu.c > >>> @@ -487,7 +487,8 @@ static const struct attribute_group *cpu_root_attr_groups[] = { > >>> bool cpu_is_hotpluggable(unsigned int cpu) > >>> { > >>> struct device *dev = get_cpu_device(cpu); > >>> - return dev && container_of(dev, struct cpu, dev)->hotpluggable; > >>> + return dev && container_of(dev, struct cpu, dev)->hotpluggable > >>> + && !tick_nohz_cpu_hotpluggable(cpu); > >> > >> Oops, I should lose that "!" , but otherwise should be ok. > > Looks plausible to me, According to your fantastic fix, I will perform > > a new round of tests on the PPC VM of open source Lab of Oregon State > > University. > > Thank you! And if it passes, I will add your Tested-by tag for attribution if you do not mind. Thank you very much in advance for giving me a Tested-by, I like it very much ;-) After patching 8e82c28ea2b4(torture: Make thread detection more robust by using lspcu) to linux-5.15.y on PPC64 VM, I can proceed with the torturing test now. The test performed on original linux-5.15.y still needs an hour or two to finish, after that I can apply your fix, and perform another 20+ hours torturing test (it is a little slow because it is on a virtual machine). Thank you for your patience. Cheers Zhouyi > > > I learned a lot during this process > > Cool!! > > - Joel > > > > > > Thanks > > Zhouyi