Hi Sebastian, Sorry for the late response due to bandwidth issue. Please see my reply inline. On Thursday, February 25, 2021 12:38 AM, Sebastian Siewior wrote: > > On 2021-02-23 08:43:08 [+0530], Viresh Kumar wrote: > > On 23-02-21, 02:26, Ran Wang wrote: > > > On Monday, February 22, 2021 10:01 PM, Sebastian Siewior wrote: > > > > On 2021-02-19 16:44:20 [+0800], Ran Wang wrote: > > > > > When selecting PREEMPT_RT, cpufreq_driver->stop_cpu(policy) > > > > > might got stuck due to irq_work_sync() pending for work on > > > > > lazy_list. That’s because lazy_list may have no chance to be > > > > > served in softirq context sometimes. Below is one of scenarios that was captured: > > > > > > > > > > ... > > > > > ret_from_fork > > > > > kthread > > > > > smpboot_thread_fn > > > > > cpuhp_thread_fun > > > > > cpuhp_invoke_callback: state: 193 > > > > > cpuhp_cpufreq_online > > > > > cpufreq_online > > > > > cpufreq_driver->stop_cpu(policy); > > > > > cpufreq_dbs_governor_stop > > > > > sugov_stop // kernel/sched/cpufreq_schedutil.c > > > > > irq_work_sync(&sg_policy->irq_work); > > > > > > > > > > This is observed on LX2160ARDB (16 A72 cores) with cpufreq > > > > > governor of ‘schedutil’ or ‘ondemand’. > > > > > > > > While staring at it, why do we invoke schedule_work_on() and > > > > kthread_queue_work() from inside irq_work() instead invoking it directly? It raises an interrupt in which it kicks a user thread. > > > > Couldn't we do it without irq_work? > > > > Because we reach there from scheduler's context, which must be > > hard-irq context.. > > Oha. Too bad. > Ran, any chance to figure out why the softirq can not be served? > ksoftirqd and the timer should be up. Further debugging shows that sometimes the nearest coming local timer is expected to expire after a long time (100+ seconds). That’s why function run_local_timers() gets no chance to call raise_softirq(TIMER_SOFTIRQ) to handle enqueued irq_work for a long timer, causing irq_work_sync() pending when doing CPU hotplug operation. Below is one of trace I captured, showing that sometimes [timeout=xxx] value would be much bigger than others (see [timeout=30070]). And we can see on this particular example (CPU2), there would be 100+ seconds idle between 2 'timer_expire_entry' (see timestamp of 613.43867) ksoftirqd/2-26 [002] d..s.13 35.046622: timer_expire_entry: timer=000000008f7d5140 function=delayed_work_timer_fn now=4294901034 baseclk=4294900752 ksoftirqd/2-26 [002] dnLs.13 35.046634: timer_expire_exit: timer=000000008f7d5140 kworker/2:1-106 [002] d...1.. 38.942670: timer_start: timer=000000008f7d5140 function=delayed_work_timer_fn expires=4294902244 [timeout=236] cpu=2 idx=125 flags=D|I ksoftirqd/2-26 [002] d..s.13 316.486673: timer_expire_entry: timer=000000008f7d5140 function=delayed_work_timer_fn now=4294971394 baseclk=4294902248 ksoftirqd/2-26 [002] d..s.13 316.486688: timer_expire_exit: timer=000000008f7d5140 ksoftirqd/2-26 [002] ...s.13 316.486694: timer_expire_entry: timer=00000000b44e5013 function=idle_worker_timeout now=4294971394 baseclk=4294971392 ksoftirqd/2-26 [002] .nLs.13 316.486707: timer_expire_exit: timer=00000000b44e5013 kworker/u16:0-7 [002] d...1.. 492.606742: timer_start: timer=000000003fe6a257 function=delayed_work_timer_fn expires=4295045494 [timeout=30070] cpu=2 idx=217 flags=I kworker/u16:0-7 [002] d...1.. 492.606756: timer_start: timer=000000003b28e768 function=delayed_work_timer_fn expires=4295022994 [timeout=7570] cpu=2 idx=237 flags=D|I kworker/u16:0-7 [002] d...113 492.606766: timer_start: timer=0000000062a01a47 function=delayed_work_timer_fn expires=4295019174 [timeout=3750] cpu=2 idx=171 flags=D|I kworker/u16:0-7 [002] d...113 492.606776: timer_start: timer=000000000bbd8bb1 function=delayed_work_timer_fn expires=4295019174 [timeout=3750] cpu=2 idx=171 flags=D|I kworker/u16:0-7 [002] d...1.. 492.606860: timer_start: timer=00000000f1f3a9cb function=delayed_work_timer_fn expires=4295015455 [timeout=31] cpu=2 idx=32 flags=D|I ksoftirqd/2-26 [002] d..s.13 613.438671: timer_expire_entry: timer=000000003fe6a257 function=delayed_work_timer_fn now=4295045632 baseclk=4295045632 ksoftirqd/2-26 [002] d..s.13 613.438695: timer_expire_exit: timer=000000003fe6a257 ksoftirqd/2-26 [002] d..s.13 613.438703: timer_expire_entry: timer=00000000f1f3a9cb function=delayed_work_timer_fn now=4295045632 baseclk=4295015456 ksoftirqd/2-26 [002] d..s.13 613.438717: timer_expire_exit: timer=00000000f1f3a9cb ksoftirqd/2-26 [002] d..s.13 613.438723: timer_expire_entry: timer=000000000bbd8bb1 function=delayed_work_timer_fn now=4295045632 baseclk=4295019200 ksoftirqd/2-26 [002] d..s.13 613.438741: timer_expire_exit: timer=000000000bbd8bb1 ksoftirqd/2-26 [002] d..s.13 613.438745: timer_expire_entry: timer=0000000062a01a47 function=delayed_work_timer_fn now=4295045632 baseclk=4295019200 ksoftirqd/2-26 [002] d..s.13 613.438756: timer_expire_exit: timer=0000000062a01a47 ksoftirqd/2-26 [002] d..s.13 613.438762: timer_expire_entry: timer=000000003b28e768 function=delayed_work_timer_fn now=4295045632 baseclk=4295023104 > Any numbers on how often that is irq_work is scheduled? I add some prints in function run_timer_softirq(), and collect some worst cases (CPU 5 and CPU3) as below, looks some irq_work might have to wait for long time to get served in such case: [ 13.040774] [005]:run_timer_softirq():call irq_work_tick_soft(); [ 17.078788] [005]:run_timer_softirq():call irq_work_tick_soft(); [ 19.222878] [005]:run_timer_softirq():call irq_work_tick_soft(); [ 49.608798] [005]:run_timer_softirq():call irq_work_tick_soft(); [ 56.564773] [005]:run_timer_softirq():call irq_work_tick_soft(); [ 124.368772] [005]:run_timer_softirq():call irq_work_tick_soft(); [ 124.880821] [005]:run_timer_softirq():call irq_work_tick_soft(); [ 125.392842] [005]:run_timer_softirq():call irq_work_tick_soft(); [ 129.556828] [005]:run_timer_softirq():call irq_work_tick_soft(); [ 132.336839] [005]:run_timer_softirq():call irq_work_tick_soft(); [ 253.168839] [005]:run_timer_softirq():call irq_work_tick_soft(); [ 376.048841] [005]:run_timer_softirq():call irq_work_tick_soft(); [ 13.072774] [003]:run_timer_softirq():call irq_work_tick_soft(); [ 13.232774] [003]:run_timer_softirq():call irq_work_tick_soft(); [ 14.064773] [003]:run_timer_softirq():call irq_work_tick_soft(); [ 15.056773] [003]:run_timer_softirq():call irq_work_tick_soft(); [ 16.080773] [003]:run_timer_softirq():call irq_work_tick_soft(); [ 17.072772] [003]:run_timer_softirq():call irq_work_tick_soft(); [ 17.840774] [003]:run_timer_softirq():call irq_work_tick_soft(); [ 18.352775] [003]:run_timer_softirq():call irq_work_tick_soft(); [ 18.864774] [003]:run_timer_softirq():call irq_work_tick_soft(); [ 19.208796] [003]:run_timer_softirq():call irq_work_tick_soft(); [ 19.376772] [003]:run_timer_softirq():call irq_work_tick_soft(); [ 19.888776] [003]:run_timer_softirq():call irq_work_tick_soft(); [ 20.118792] [003]:run_timer_softirq():call irq_work_tick_soft(); [ 82.992773] [003]:run_timer_softirq():call irq_work_tick_soft(); [ 125.072801] [003]:run_timer_softirq():call irq_work_tick_soft(); [ 126.064772] [003]:run_timer_softirq():call irq_work_tick_soft(); [ 127.056796] [003]:run_timer_softirq():call irq_work_tick_soft(); [ 128.976773] [003]:run_timer_softirq():call irq_work_tick_soft(); Regards, Ran