Hi, On Thursday, January 7, 2021 5:19 PM, Ran Wang wrote: > > When doing CPU un-plug stress test, function smpboot_park_threads() would get call to park kernel threads (which including ksoftirqd) on > that CPU core, and function wait_task_inactive() would yield for those queued > task(s) by calling schedule_hrtimerout() with mode of HRTIMER_MODE_REL. > > stack trace: > ... > smpboot_thread_fn > cpuhp_thread_fun > cpuhp_invoke_callback > smpboot_park_threads > smpboot_park_thread: ksoftirqd/1 > kthread_park > wait_task_inactive > schedule_hrtimerout > > However, when PREEMPT_RT is set, this would cause a pending issue since > schedule_hrtimerout() depend on thread ksoftirqd to complete related work if it using HRTIMER_MODE_SOFT. So force using > HRTIMER_MODE_HARD in such case. This issue was observed on LX2160ARDB (arm64, 16 A72 cores) when selecting PREEMPT_RT, non-RT kernel works fine.And I could verify that fix on both linux-5.6.y-rt and linux-5.4.y-rt. But for linux-5.9.y-rt and linux-5.10.y-rt, looks there are other issues which blocking verification currently. Below is the steps for issue reproducing: 1. Kernel menuconfig: CONFIG_QORIQ_CPUFREQ=y CONFIG_HAVE_PREEMPT_LAZY=y CONFIG_PREEMPT_LAZY=y # CONFIG_PREEMPT_NONE is not set # CONFIG_PREEMPT_VOLUNTARY is not set # CONFIG_PREEMPT is not set CONFIG_PREEMPT_RT=y CONFIG_PREEMPT_COUNT=y CONFIG_PREEMPTION=y 2. Shell commands (Issue would happen within roughly 400 rounds of below loop) echo ondemand > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor echo ondemand > /sys/devices/system/cpu/cpu1/cpufreq/scaling_governor echo ondemand > /sys/devices/system/cpu/cpu2/cpufreq/scaling_governor echo ondemand > /sys/devices/system/cpu/cpu3/cpufreq/scaling_governor echo ondemand > /sys/devices/system/cpu/cpu4/cpufreq/scaling_governor echo ondemand > /sys/devices/system/cpu/cpu5/cpufreq/scaling_governor echo ondemand > /sys/devices/system/cpu/cpu6/cpufreq/scaling_governor echo ondemand > /sys/devices/system/cpu/cpu7/cpufreq/scaling_governor echo ondemand > /sys/devices/system/cpu/cpu8/cpufreq/scaling_governor echo ondemand > /sys/devices/system/cpu/cpu9/cpufreq/scaling_governor echo ondemand > /sys/devices/system/cpu/cpu10/cpufreq/scaling_governor echo ondemand > /sys/devices/system/cpu/cpu11/cpufreq/scaling_governor echo ondemand > /sys/devices/system/cpu/cpu12/cpufreq/scaling_governor echo ondemand > /sys/devices/system/cpu/cpu13/cpufreq/scaling_governor echo ondemand > /sys/devices/system/cpu/cpu14/cpufreq/scaling_governor echo ondemand > /sys/devices/system/cpu/cpu15/cpufreq/scaling_governor count=1 while [ $? -eq 0 ] do echo "$count th test" sleep 3 let "count=count+1" echo 0 > /sys/devices/system/cpu/cpu0/online echo 0 > /sys/devices/system/cpu/cpu1/online echo 0 > /sys/devices/system/cpu/cpu2/online echo 0 > /sys/devices/system/cpu/cpu3/online echo 0 > /sys/devices/system/cpu/cpu4/online echo 0 > /sys/devices/system/cpu/cpu5/online echo 0 > /sys/devices/system/cpu/cpu6/online echo 0 > /sys/devices/system/cpu/cpu7/online echo 0 > /sys/devices/system/cpu/cpu8/online echo 0 > /sys/devices/system/cpu/cpu9/online echo 0 > /sys/devices/system/cpu/cpu10/online echo 0 > /sys/devices/system/cpu/cpu11/online echo 0 > /sys/devices/system/cpu/cpu12/online echo 0 > /sys/devices/system/cpu/cpu13/online echo 0 > /sys/devices/system/cpu/cpu14/online echo 1 > /sys/devices/system/cpu/cpu0/online echo 1 > /sys/devices/system/cpu/cpu1/online echo 1 > /sys/devices/system/cpu/cpu2/online echo 1 > /sys/devices/system/cpu/cpu3/online echo 1 > /sys/devices/system/cpu/cpu4/online echo 1 > /sys/devices/system/cpu/cpu5/online echo 1 > /sys/devices/system/cpu/cpu6/online echo 1 > /sys/devices/system/cpu/cpu7/online echo 1 > /sys/devices/system/cpu/cpu8/online echo 1 > /sys/devices/system/cpu/cpu9/online echo 1 > /sys/devices/system/cpu/cpu10/online echo 1 > /sys/devices/system/cpu/cpu11/online echo 1 > /sys/devices/system/cpu/cpu12/online echo 1 > /sys/devices/system/cpu/cpu13/online echo 1 > /sys/devices/system/cpu/cpu14/online done To be honest, I am not sure how non-RT kernel could avoid this issue. Could anybody give some input/suggestion on this? Thank you. Regards, Ran > Suggested-by: Jiafei Pan <jiafei.pan@xxxxxxx> > Signed-off-by: Ran Wang <ran.wang_1@xxxxxxx> > --- > kernel/sched/core.c | 9 +++++++-- > 1 file changed, 7 insertions(+), 2 deletions(-) > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 792da55..4cc742a 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -2054,10 +2054,15 @@ unsigned long wait_task_inactive(struct task_struct *p, long match_state) > ktime_t to = NSEC_PER_SEC / HZ; > > set_current_state(TASK_UNINTERRUPTIBLE); > - schedule_hrtimeout(&to, HRTIMER_MODE_REL); > + > + if (IS_ENABLED(CONFIG_PREEMPT_RT) && > + !strncmp(p->comm, "ksoftirqd/", 10)) > + schedule_hrtimeout(&to, > + HRTIMER_MODE_REL | HRTIMER_MODE_HARD); > + else > + schedule_hrtimeout(&to, HRTIMER_MODE_REL); > continue; > } > - > /* > * Ahh, all good. It wasn't running, and it wasn't > * runnable, which means that it will never become > -- > 2.7.4