Thanks Michael for reviewing my patch On Mon, Oct 10, 2022 at 7:21 PM Michael Ellerman <mpe@xxxxxxxxxxxxxx> wrote: > > Zhouyi Zhou <zhouzhouyi@xxxxxxxxx> writes: > > I think we should avoid torture offline the cpu who do tick timer > > when nohz full is running. > > Can you tell us what the bug you're fixing is? > > Did you see a crash/oops/hang etc? Or are you just proposing this as > something that would be a good idea? Sorry for the trouble and inconvenience that I bring to the community. I haven't made myself clear in my patch. The ins and outs are as follows: 1) cd linux-next 2) ./tools/testing/selftests/rcutorture/bin/torture.sh after 19 hours ;-) 3) tail ./tools/testing/selftests/rcutorture/res/2022.09.30-01.06.22-torture/results-scftorture/NOPREEMPT/console.log [ 121.449268][ T57] scftorture: scf_invoked_count VER: 2415215 resched: 697463 single: 619512/619760 single_ofl: 255751/256554 single_rpc: 620692 single_rpc_ofl: 0 many: 155476/154658 all: 77282/76988 onoff: 3/3:5/6 18,25:9,28 63:93 (HZ=100) ste: 0 stnmie: 0 stnmoe: 0 staf: 0 [ 121.454485][ T57] scftorture: --- End of test: LOCK_HOTPLUG: verbose=1 holdoff=10 longwait=0 nthreads=4 onoff_holdoff=30 onoff_interval=1000 shutdown_secs=1 stat_interval=15 stutter=5 use_cpus_read_lock=0, weight_resched=-1, weight_single=-1, weight_single_rpc=-1, weight_single_wait=-1, weight_many=-1, weight_many_wait=-1, weight_all=-1, weight_all_wait=-1 [ 121.469305][ T57] reboot: Power down I see "End of test: LOCK_HOTPLUG", which means the function torture_offline in kernel torture.c failed to bring down the cpu. 4) Then I chase the reason down to tick_nohz_cpu_down: if (tick_nohz_full_running && tick_do_timer_cpu == cpu) return -EBUSY; 5) I create above patch > > > Tested on PPC VM of Open Source Lab of Oregon State University. > > The test results show that after the fix, the success rate of > > rcutorture is improved. > > After: > > Successes: 40 Failures: 9 > > Before: > > Successes: 38 Failures: 11 > > > > I examined the console.log and Make.out files one by one, no new > > compile error or test error is introduced by above fix. > > > > Signed-off-by: Zhouyi Zhou <zhouzhouyi@xxxxxxxxx> > > --- > > Dear PPC developers > > > > I found this bug when trying to do rcutorture tests in ppc VM of > > Open Source Lab of Oregon State University: > > > > ubuntu@ubuntu:~/linux-next/tools/testing/selftests/rcutorture/res/2022.09.30-01.06.22-torture$ find . -name "console.log.diags"|xargs grep HOTPLUG > > ./results-scftorture/NOPREEMPT/console.log.diags:WARNING: HOTPLUG FAILURES NOPREEMPT > > ./results-rcutorture/TASKS03/console.log.diags:WARNING: HOTPLUG FAILURES TASKS03 > > ./results-rcutorture/TREE04/console.log.diags:WARNING: HOTPLUG FAILURES TREE04 > > ./results-scftorture-kasan/NOPREEMPT/console.log.diags:WARNING: HOTPLUG FAILURES NOPREEMPT > > ./results-rcutorture-kasan/TASKS03/console.log.diags:WARNING: HOTPLUG FAILURES TASKS03 > > ./results-rcutorture-kasan/TREE04/console.log.diags:WARNING: HOTPLUG FAILURES TREE04 > > > > I tried to fix this bug. > > > > Thanks for your patience and guidance ;-) > > > > Thanks > > Zhouyi > > -- > > arch/powerpc/kernel/sysfs.c | 8 +++++++- > > 1 file changed, 7 insertions(+), 1 deletion(-) > > > > diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c > > index ef9a61718940..be9c0e45337e 100644 > > --- a/arch/powerpc/kernel/sysfs.c > > +++ b/arch/powerpc/kernel/sysfs.c > > @@ -4,6 +4,7 @@ > > #include <linux/smp.h> > > #include <linux/percpu.h> > > #include <linux/init.h> > > +#include <linux/tick.h> > > #include <linux/sched.h> > > #include <linux/export.h> > > #include <linux/nodemask.h> > > @@ -21,6 +22,7 @@ > > #include <asm/firmware.h> > > #include <asm/idle.h> > > #include <asm/svm.h> > > +#include "../../../kernel/time/tick-internal.h" > > Needing to include this internal header is a sign that we are using the > wrong API or otherwise using time keeping internals we shouldn't be. Yes, when I do this, I guess there is something wrong in my patch. > > > #include "cacheinfo.h" > > #include "setup.h" > > @@ -1151,7 +1153,11 @@ static int __init topology_init(void) > > * CPU. For instance, the boot cpu might never be valid > > * for hotplugging. > > */ > > - if (smp_ops && smp_ops->cpu_offline_self) > > + if (smp_ops && smp_ops->cpu_offline_self > > +#ifdef CONFIG_NO_HZ_FULL > > + && !(tick_nohz_full_running && tick_do_timer_cpu == cpu) > > +#endif > > + ) > > I can't see any other arches doing anything like this. I don't think > it's the arches responsibility. Agree! X86 seems to disable CPU0's hotplug by default, while tick_do_timer_cpu has a default value 0. 42 #ifdef CONFIG_BOOTPARAM_HOTPLUG_CPU0 43 static int cpu0_hotpluggable = 1; 44 #else 45 static int cpu0_hotpluggable; 46 static int __init enable_cpu0_hotplug(char *str) 47 { 48 cpu0_hotpluggable = 1; 49 return 1; 50 } 51 52 __setup("cpu0_hotplug", enable_cpu0_hotplug); 53 #endif I need more time to make clear the relationship of X86's cpu0_hotpluggable and tick_do_timer_cpu, but I also intend to think it's time keeping the mechanism's responsibility. > > If the time keeping core needs a CPU to stay online to run the timer > then it needs to organise that itself IMHO :) Um, I am going to submit a patch to time keeping community sometime next month ;-) Thanks again Cheers Zhouyi > > cheers > > > c->hotpluggable = 1; > > #endif > > > > -- > > 2.25.1