Hi, I also reappear the same phenomenon in RISC-V: [ 120.156380] scftorture: --- End of test: LOCK_HOTPLUG So I guess it is not the arch's responsibility. I am very interested in it ;-) Thank you both for your guidance! Cheers Zhouyi On Tue, Oct 11, 2022 at 9:59 AM Zhouyi Zhou <zhouzhouyi@xxxxxxxxx> wrote: > > Thanks Michael for reviewing my patch > > On Mon, Oct 10, 2022 at 7:21 PM Michael Ellerman <mpe@xxxxxxxxxxxxxx> wrote: > > > > Zhouyi Zhou <zhouzhouyi@xxxxxxxxx> writes: > > > I think we should avoid torture offline the cpu who do tick timer > > > when nohz full is running. > > > > Can you tell us what the bug you're fixing is? > > > > Did you see a crash/oops/hang etc? Or are you just proposing this as > > something that would be a good idea? > Sorry for the trouble and inconvenience that I bring to the community. > I haven't made myself clear in my patch. > The ins and outs are as follows: > 1) cd linux-next > 2) ./tools/testing/selftests/rcutorture/bin/torture.sh > after 19 hours ;-) > 3) tail ./tools/testing/selftests/rcutorture/res/2022.09.30-01.06.22-torture/results-scftorture/NOPREEMPT/console.log > > [ 121.449268][ T57] scftorture: scf_invoked_count VER: 2415215 > resched: 697463 single: 619512/619760 single_ofl: 255751/256554 > single_rpc: 620692 single_rpc_ofl: 0 many: 155476/154658 all: > 77282/76988 onoff: 3/3:5/6 18,25:9,28 63:93 (HZ=100) ste: 0 stnmie: 0 > stnmoe: 0 staf: 0 > [ 121.454485][ T57] scftorture: --- End of test: LOCK_HOTPLUG: > verbose=1 holdoff=10 longwait=0 nthreads=4 onoff_holdoff=30 > onoff_interval=1000 shutdown_secs=1 stat_interval=15 stutter=5 > use_cpus_read_lock=0, weight_resched=-1, weight_single=-1, > weight_single_rpc=-1, weight_single_wait=-1, weight_many=-1, > weight_many_wait=-1, weight_all=-1, weight_all_wait=-1 > [ 121.469305][ T57] reboot: Power down > > I see "End of test: LOCK_HOTPLUG", which means the function > torture_offline in kernel torture.c failed to bring down the cpu. > 4) Then I chase the reason down to tick_nohz_cpu_down: > if (tick_nohz_full_running && tick_do_timer_cpu == cpu) > return -EBUSY; > 5) I create above patch > > > > > Tested on PPC VM of Open Source Lab of Oregon State University. > > > The test results show that after the fix, the success rate of > > > rcutorture is improved. > > > After: > > > Successes: 40 Failures: 9 > > > Before: > > > Successes: 38 Failures: 11 > > > > > > I examined the console.log and Make.out files one by one, no new > > > compile error or test error is introduced by above fix. > > > > > > Signed-off-by: Zhouyi Zhou <zhouzhouyi@xxxxxxxxx> > > > --- > > > Dear PPC developers > > > > > > I found this bug when trying to do rcutorture tests in ppc VM of > > > Open Source Lab of Oregon State University: > > > > > > ubuntu@ubuntu:~/linux-next/tools/testing/selftests/rcutorture/res/2022.09.30-01.06.22-torture$ find . -name "console.log.diags"|xargs grep HOTPLUG > > > ./results-scftorture/NOPREEMPT/console.log.diags:WARNING: HOTPLUG FAILURES NOPREEMPT > > > ./results-rcutorture/TASKS03/console.log.diags:WARNING: HOTPLUG FAILURES TASKS03 > > > ./results-rcutorture/TREE04/console.log.diags:WARNING: HOTPLUG FAILURES TREE04 > > > ./results-scftorture-kasan/NOPREEMPT/console.log.diags:WARNING: HOTPLUG FAILURES NOPREEMPT > > > ./results-rcutorture-kasan/TASKS03/console.log.diags:WARNING: HOTPLUG FAILURES TASKS03 > > > ./results-rcutorture-kasan/TREE04/console.log.diags:WARNING: HOTPLUG FAILURES TREE04 > > > > > > I tried to fix this bug. > > > > > > Thanks for your patience and guidance ;-) > > > > > > Thanks > > > Zhouyi > > > -- > > > arch/powerpc/kernel/sysfs.c | 8 +++++++- > > > 1 file changed, 7 insertions(+), 1 deletion(-) > > > > > > diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c > > > index ef9a61718940..be9c0e45337e 100644 > > > --- a/arch/powerpc/kernel/sysfs.c > > > +++ b/arch/powerpc/kernel/sysfs.c > > > @@ -4,6 +4,7 @@ > > > #include <linux/smp.h> > > > #include <linux/percpu.h> > > > #include <linux/init.h> > > > +#include <linux/tick.h> > > > #include <linux/sched.h> > > > #include <linux/export.h> > > > #include <linux/nodemask.h> > > > @@ -21,6 +22,7 @@ > > > #include <asm/firmware.h> > > > #include <asm/idle.h> > > > #include <asm/svm.h> > > > +#include "../../../kernel/time/tick-internal.h" > > > > Needing to include this internal header is a sign that we are using the > > wrong API or otherwise using time keeping internals we shouldn't be. > Yes, when I do this, I guess there is something wrong in my patch. > > > > > #include "cacheinfo.h" > > > #include "setup.h" > > > @@ -1151,7 +1153,11 @@ static int __init topology_init(void) > > > * CPU. For instance, the boot cpu might never be valid > > > * for hotplugging. > > > */ > > > - if (smp_ops && smp_ops->cpu_offline_self) > > > + if (smp_ops && smp_ops->cpu_offline_self > > > +#ifdef CONFIG_NO_HZ_FULL > > > + && !(tick_nohz_full_running && tick_do_timer_cpu == cpu) > > > +#endif > > > + ) > > > > I can't see any other arches doing anything like this. I don't think > > it's the arches responsibility. > Agree! > > X86 seems to disable CPU0's hotplug by default, while > tick_do_timer_cpu has a default value 0. > > 42 #ifdef CONFIG_BOOTPARAM_HOTPLUG_CPU0 > 43 static int cpu0_hotpluggable = 1; > 44 #else > 45 static int cpu0_hotpluggable; > 46 static int __init enable_cpu0_hotplug(char *str) > 47 { > 48 cpu0_hotpluggable = 1; > 49 return 1; > 50 } > 51 > 52 __setup("cpu0_hotplug", enable_cpu0_hotplug); > 53 #endif > > I need more time to make clear the relationship of X86's > cpu0_hotpluggable and tick_do_timer_cpu, but > I also intend to think it's time keeping the mechanism's responsibility. > > > > > > If the time keeping core needs a CPU to stay online to run the timer > > then it needs to organise that itself IMHO :) > > Um, I am going to submit a patch to time keeping community sometime > next month ;-) > > Thanks again > Cheers > Zhouyi > > > > cheers > > > > > c->hotpluggable = 1; > > > #endif > > > > > > -- > > > 2.25.1