On 2016-08-19 17:29:16 [+0300], Grygorii Strashko wrote: > I've collected trace before first occurrence of "NOHZ: local_softirq_pending 80" > > irq/354-4848400-85 [000] 90.639460: irq_handler_entry: irq=19 name=arch_timer > iperf-1284 [001] 90.639474: softirq_raise: vec=1 [action=TIMER] > iperf-1284 [001] 90.639486: irq_handler_exit: irq=19 ret=handled > irq/354-4848400-85 [000] 90.639488: softirq_raise: vec=7 [action=SCHED] here we raise the SCHED softirq > iperf-1284 [001] 90.639490: sched_waking: comm=ksoftirqd/1 pid=21 prio=120 target_cpu=001 > irq/354-4848400-85 [000] 90.639492: softirq_raise: vec=1 [action=TIMER] and a timer > iperf-1284 [001] 90.639499: sched_wakeup: ksoftirqd/1:21 [120] success=1 CPU:001 > iperf-1284 [001] 90.639504: sched_waking: comm=ktimersoftd/1 pid=20 prio=98 target_cpu=001 > irq/354-4848400-85 [000] 90.639505: irq_handler_exit: irq=19 ret=handled okay. so softirq here I come > iperf-1284 [001] 90.639512: sched_wakeup: ktimersoftd/1:20 [98] success=1 CPU:001 > iperf-1284 [001] 90.639526: sched_stat_runtime: comm=iperf pid=1284 runtime=50752 [ns] vruntime=2105322850 [ns] > iperf-1284 [001] 90.639537: sched_switch: iperf:1284 [120] R ==> irq/355-4848400:86 [49] > irq/355-4848400-86 [001] 90.639545: softirq_raise: vec=3 [action=NET_RX] > irq/355-4848400-86 [001] 90.639556: softirq_entry: vec=3 [action=NET_RX] > irq/355-4848400-86 [001] 90.639589: softirq_exit: vec=3 [action=NET_RX] > irq/355-4848400-86 [001] 90.639614: sched_switch: irq/355-4848400:86 [49] S ==> ktimersoftd/1:20 [98] > ktimersoftd/1-20 [001] 90.639625: softirq_entry: vec=1 [action=TIMER] > ktimersoftd/1-20 [001] 90.639637: sched_waking: comm=rcu_preempt pid=8 prio=98 target_cpu=001 > ktimersoftd/1-20 [001] 90.639646: sched_wakeup: rcu_preempt:8 [98] success=1 CPU:001 > ktimersoftd/1-20 [001] 90.639663: softirq_exit: vec=1 [action=TIMER] > ktimersoftd/1-20 [001] 90.639679: sched_switch: ktimersoftd/1:20 [98] S ==> rcu_preempt:8 [98] > rcu_preempt-8 [001] 90.639722: sched_switch: rcu_preempt:8 [98] S ==> ksoftirqd/1:21 [120] > ksoftirqd/1-21 [001] 90.639740: sched_stat_runtime: comm=ksoftirqd/1 pid=21 runtime=25539 [ns] vruntime=29960463828 [ns] > ksoftirqd/1-21 [001] 90.639750: sched_switch: ksoftirqd/1:21 [120] S ==> iperf:1284 [120] > irq/354-4848400-85 [000] 90.639878: irq_handler_entry: irq=355 name=48484000.ethernet wait, no [ lots NET_RX wake ups ] > irq/354-4848400-85 [000] 90.640560: softirq_exit: vec=3 [action=NET_RX] ach. NET_RX > irq/354-4848400-85 [000] 90.640568: softirq_raise: vec=3 [action=NET_RX] > irq/354-4848400-85 [000] 90.640579: softirq_entry: vec=3 [action=NET_RX] > irq/354-4848400-85 [000] 90.640606: irq_handler_entry: irq=355 name=48484000.ethernet > irq/354-4848400-85 [000] 90.640608: irq_handler_exit: irq=355 ret=handled [ more of those ] > irq/354-4848400-85 [000] 90.642393: softirq_exit: vec=3 [action=NET_RX] > irq/354-4848400-85 [000] 90.642419: sched_switch: irq/354-4848400:85 [49] S ==> rcuc/0:11 [98] We don't serve TIMER & SCHED because those two are pushed to the ksoftirq thread(s). So we keep mostly doing NET_RX and now we switch to the next best thing which is RCU. > rcuc/0-11 [000] 90.642430: irq_handler_entry: irq=354 name=48484000.ethernet but not for long. > rcuc/0-11 [000] 90.642432: irq_handler_exit: irq=354 ret=handled > rcuc/0-11 [000] 90.642435: sched_waking: comm=irq/354-4848400 pid=85 prio=49 target_cpu=000 > rcuc/0-11 [000] 90.642442: sched_migrate_task: comm=irq/354-4848400 pid=85 prio=49 orig_cpu=0 dest_cpu=1 > rcuc/0-11 [000] 90.642453: sched_wakeup: irq/354-4848400:85 [49] success=1 CPU:001 > iperf-1284 [001] 90.642462: sched_stat_runtime: comm=iperf pid=1284 runtime=113053 [ns] vruntime=2106997666 [ns] > rcuc/0-11 [000] 90.642464: irq_handler_entry: irq=355 name=48484000.ethernet > rcuc/0-11 [000] 90.642466: irq_handler_exit: irq=355 ret=handled > rcuc/0-11 [000] 90.642469: sched_waking: comm=irq/355-4848400 pid=86 prio=49 target_cpu=001 > iperf-1284 [001] 90.642473: sched_switch: iperf:1284 [120] R ==> irq/354-4848400:85 [49] > irq/354-4848400-85 [001] 90.642481: softirq_raise: vec=3 [action=NET_RX] > rcuc/0-11 [000] 90.642483: sched_wakeup: irq/355-4848400:86 [49] success=1 CPU:001 > irq/354-4848400-85 [001] 90.642493: softirq_entry: vec=3 [action=NET_RX] > rcuc/0-11 [000] 90.642497: sched_migrate_task: comm=irq/355-4848400 pid=86 prio=49 orig_cpu=1 dest_cpu=0 ach that IRQ thread no pinned. Good. We migrate. > rcuc/0-11 [000] 90.642515: irq_handler_entry: irq=354 name=48484000.ethernet > rcuc/0-11 [000] 90.642516: irq_handler_exit: irq=354 ret=handled > rcuc/0-11 [000] 90.642533: sched_switch: rcuc/0:11 [98] R ==> irq/355-4848400:86 [49] > irq/355-4848400-86 [000] 90.642541: softirq_raise: vec=3 [action=NET_RX] > irq/355-4848400-86 [000] 90.642551: softirq_entry: vec=3 [action=NET_RX] > irq/355-4848400-86 [000] 90.642562: sched_pi_setprio: comm=iperf pid=1284 oldprio=120 newprio=49 > irq/355-4848400-86 [000] 90.642593: sched_switch: irq/355-4848400:86 [49] D ==> rcuc/0:11 [98] > rcuc/0-11 [000] 90.642621: sched_switch: rcuc/0:11 [98] S ==> ktimersoftd/0:4 [98] > ktimersoftd/0-4 [000] 90.642631: softirq_entry: vec=1 [action=TIMER] we did wait quite some time for that one. > ktimersoftd/0-4 [000] 90.642647: softirq_exit: vec=1 [action=TIMER] > ktimersoftd/0-4 [000] 90.642665: sched_switch: ktimersoftd/0:4 [98] S ==> ksoftirqd/0:3 [120] switch back ksoftirqd where SCHED softirq might run. > ksoftirqd/0-3 [000] 90.642692: sched_stat_runtime: comm=ksoftirqd/0 pid=3 runtime=34486 [ns] vruntime=33313887612 [ns] > ksoftirqd/0-3 [000] 90.642703: sched_switch: ksoftirqd/0:3 [120] D ==> trace-cmd:1278 [120] but we blocked on something in softirq and go back to trace-cmd > trace-cmd-1278 [000] 90.642742: sched_waking: comm=trace-cmd pid=1279 prio=120 target_cpu=001 > trace-cmd-1278 [000] 90.642755: sched_wakeup: trace-cmd:1279 [120] success=1 CPU:001 > irq/354-4848400-85 [001] 90.643012: softirq_exit: vec=3 [action=NET_RX] > trace-cmd-1278 [000] 90.643020: sched_stat_runtime: comm=trace-cmd pid=1278 runtime=328098 [ns] vruntime=554552351 [ns] > irq/354-4848400-85 [001] 90.643021: softirq_raise: vec=3 [action=NET_RX] > irq/354-4848400-85 [001] 90.643031: softirq_entry: vec=3 [action=NET_RX] > trace-cmd-1278 [000] 90.643036: sched_switch: trace-cmd:1278 [120] D ==> iperf:1282 [120] trace-cmd is going, back to iperf. > iperf-1282 [000] 90.643055: sched_stat_runtime: comm=iperf pid=1282 runtime=34811 [ns] vruntime=560242796 [ns] > iperf-1282 [000] 90.643068: sched_switch: iperf:1282 [120] D ==> swapper/0:0 [120] iperf is blocked on something, too. > <idle>-0 [000] 90.643079: bputs: 0xc003e7d4s: 58f67c094e488 > <idle>-0 [000] 90.643090: bprint: softirq_check_pending_idle: NOHZ: 80 As you see ksoftirqd left the CPU with a D so I would assume it is blocked on a lock and waits. NET_RX is in progress but scheduled out due to RCUC which is also scheduled out. I assume we got to softirq because nothing else can run. It will see that NET_RX is pending and tries it but blocks on the lock (lock_softirq()). It schedules out. Nothing left -> idle. The idle code checks to see if a softirq is pending and in fact there is SCHED on the list and ksoftirq was about to handle it but due to ordering complication (NET_RX before SCHED) it can't. And we have the warning. This --- a/kernel/softirq.c +++ b/kernel/softirq.c @@ -105,6 +105,7 @@ void softirq_check_pending_idle(void) { static int rate_limit; struct softirq_runner *sr = this_cpu_ptr(&softirq_runners); + struct task_struct *ksoft_tsk = __this_cpu_read(ksoftirqd); u32 warnpending; int i; @@ -112,7 +113,7 @@ void softirq_check_pending_idle(void) return; warnpending = local_softirq_pending() & SOFTIRQ_STOP_IDLE_MASK; - for (i = 0; i < NR_SOFTIRQS; i++) { + for (i = 0; (i < NR_SOFTIRQS) && warnpending; i++) { struct task_struct *tsk = sr->runner[i]; /* @@ -132,6 +133,15 @@ void softirq_check_pending_idle(void) } } + if (warnpending && ksoft_tsk) { + raw_spin_lock(&ksoft_tsk->pi_lock); + if (ksoft_tsk->pi_blocked_on || ksoft_tsk->state == TASK_RUNNING) { + /* Clear all bits pending in that task */ + warnpending &= ~(ksoft_tsk->softirqs_raised); + } + raw_spin_unlock(&ksoft_tsk->pi_lock); + } + if (warnpending) { printk(KERN_ERR "NOHZ: local_softirq_pending %02x\n", warnpending); should avoid the warning if the softirq thread is blocked. However: - RCU boosting with RT prio might help here or there. - having the hard-IRQ and IRQ-thread on the same CPU might help, too. It is not strictly required but saves a few cycles if you don't have to perform cross CPU wake ups and migrate task forth and back. The latter happens at prio 99. - I am not sure NAPI works as expected. I would assume so. There is IRQ 354 and 355 which fire after each other. One would be enough I guess. And they seem to be short living / fire often. If NAPI works then it should put an end to it and push it to the softirq thread. If you have IRQ-pacing support I suggest to use something like 10ms or so. That means your ping response will go from <= 1ms to 10ms in the worst case but since you process more packets at a time your throughput should increase. If I count this correct, it too you alsmost 4ms from "raise SCHED" to "try process SCHED" and most of the time was spent in 35[45] hard irq, raise NET_RX or cross wakeup the IRQ thread. Sebastian -- To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html