Hi All, > We are running into a bug on a ARM926 processor with 2.6.26.8-rt16 > where the system stops scheduling any thread while the hard-irq > handlers keep on running (and waking up tasks with wake_up_process()) > What happens is that the system is idle and is continuously polling on > need_resched() in the cpu_idle() routine (arch/arm/kernel/process.c) > But, need_resched() keeps on returning false, so the system stays in > the idle loop and __schedule is never called. > > While debugging this we noticed/tried several things already: > * default_idle() wakes up properly when an interrupt occurs. > * Interrupt handling keeps on running, but no interrupt threads are > scheduled either. > * Triggering a different interrupt source sometimes get the system out > of the lockup condition. > * removing the loop with need_resched() to force the system to call > __schedule() after waking up from default_idle() has no effect, the > scheduler does not schedule any threads that are clearly runnable. See > logging below. > * preempt_schedule_irq() is not called after an irq during the lockup. > Probably it is because the check on preempt_count!=0 in __irq_svc (in > arch/arm/kernel/entry-armv.S). > * We have excluded any relation with any user application by adding > root_delay=<1-month-in-seconds> to the kernel commandline. > * We stripped down the kernel as much as possible, excluding almost > all drivers, including removing the networking layer. (but with > networking and a ping flood it seems easier to reproduce) > * Problem exists with NO_HZ, HRT as well as without it. > > We call 'print_cpu()' when a lockup occurs. We call it directly from a > hard-irq handler. > (canary_check is a kernel thread that is part of our debugging code. > It is a thread that is directly scheduled via wake_up_process() from > the system timer irq. We check from the hard-irq handler if this > thread is being scheduled, if not the system is locked up and we dump > this logging) > > [ 8931.475755] > [ 8931.475769] cpu#0 > [ 8931.479186] .nr_running : 3 > [ 8931.483809] .load : 532566 > [ 8931.488862] .nr_switches : 37438443 > [ 8931.494089] .nr_load_updates : 860943 > [ 8931.499142] .nr_uninterruptible : 1 > [ 8931.503761] .jiffies : 830943 > [ 8931.508816] .next_balance : 0.000000 > [ 8931.514043] .curr->pid : 0 > [ 8931.518664] .clock : 8931475.737362 > [ 8931.524411] .cpu_load[0] : 532566 > [ 8931.529464] .cpu_load[1] : 532420 > [ 8931.534516] .cpu_load[2] : 516298 > [ 8931.539569] .cpu_load[3] : 433653 > [ 8931.544621] .cpu_load[4] : 310689 > [ 8931.549674] .rt.rt_nr_running : 3 > [ 8931.554293] .rt.rt_nr_uninterruptible : 4294967293 > [ 8931.559694] > [ 8931.559703] cfs_rq[0]: > [ 8931.563548] .exec_clock : 0.000000 > [ 8931.568778] .MIN_vruntime : 0.000001 > [ 8931.574007] .min_vruntime : 1254.953603 > [ 8931.579496] .max_vruntime : 0.000001 > [ 8931.584724] .spread : 0.000000 > [ 8931.589952] .spread0 : 0.000000 > [ 8931.595178] .nr_running : 0 > [ 8931.599796] .load : 0 > [ 8931.604414] .nr_spread_over : 0 > [ 8931.609048] > [ 8931.609056] runnable tasks: > [ 8931.609067] task PID tree-key switches prio > exec-runtime sum-exec sum-sleep > [ 8931.609094] ---------------------------------------------------------------------------------------------------------- > [ 8931.634702] sirq-timer/0 5 6.652438 1003780 59 > 0 0 0.000000 > 0.000000 0.000000 > [ 8931.649795] canary_check 33 78.141669 860433 0 > 0 0 0.000000 > 0.000000 0.000000 > [ 8931.664887] IRQ-57 243 832.211283 13874273 59 > 0 0 0.000000 > 0.000000 0.000000 > [ 8931.764591] > Well, we found the root cause of this problem. It turned out to be caused by sched_clock() that made disjunct time jumps. This caused this check to become true in kernel/sched_rt.c:370: if (rt_rq->rt_time > runtime) { rt_rq->rt_throttled = 1; if (rt_rq_throttled(rt_rq)) { sched_rt_rq_dequeue(rt_rq); return 1; } } The end results is that all realtime tasks got throttled for a long time, and that time got extended every time sched_clock() made such a jump. I would never have expected the scheduler would show this kind of behaviour while CONFIG_RT_GROUP_SCHED is _not_ set... The root-cause of the sched_clock being faulty was a synchronisation issue between 2 clock domains. The CPU clock and the clock domain of the peripheral (GPT) on which the sched_clock() implementation was based. The GPT made jumps backwards which triggered a false wraparound detection in the conversion of 32->64 bit timestamps, causing the time to jump about 356 seconds in the future... Thinking about this, we would have found it much earlier if print_cpu() would have printed the 'throttled' state of a task as well. Besides that, it would have helped a lot if the scheduler would print a message if such time jumps happen... If there is interest I can provide a patch for that. Kind regards, Remy -- To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html