I'm running a linux-2.6.26.1 kernel with the real-time patch-2.6.26-rt1 plus most of the patches discussed on the linux-rt-users list since rt1. (except ppc patches, and not Gregory Haskins experimental stuff) The above info is mostly just full exclosure, I'm sure that this bug exists in plain linux-2.6.26 + real-time patch-2.6.26-rt1 too. BUG: using smp_processor_id() in preemptible [00000000] code: firefox-bin/4091 caller is __qdisc_run+0x160/0x1e9 Pid: 4091, comm: firefox-bin Tainted: G W 2.6.26.1-rt1.jk #4 Call Trace: [<ffffffff803468ac>] debug_smp_processor_id+0xe4/0xf4 [<ffffffff8040673f>] __qdisc_run+0x160/0x1e9 [<ffffffff803f406c>] dev_queue_xmit+0x1b3/0x2ee [<ffffffff8041abec>] ip_finish_output+0x2a6/0x2ef [<ffffffff8041ad18>] ip_output+0xe3/0xec [<ffffffff80419c54>] ip_local_out+0x25/0x29 [<ffffffff8041a4c6>] ip_queue_xmit+0x2ce/0x35e [<ffffffff8042e274>] ? __tcp_push_pending_frames+0x74a/0x860 [<ffffffff8042e274>] ? __tcp_push_pending_frames+0x74a/0x860 [<ffffffff80286265>] ? trace_preempt_on+0x1f/0x105 [<ffffffff8042b6b2>] ? tcp_transmit_skb+0x72a/0x78f [<ffffffff8042e274>] ? __tcp_push_pending_frames+0x74a/0x860 [<ffffffff8042b6d8>] tcp_transmit_skb+0x750/0x78f [<ffffffff8042e274>] __tcp_push_pending_frames+0x74a/0x860 [<ffffffff802b53b3>] ? __kmalloc_node+0x48/0x4a [<ffffffff803ee44d>] ? __alloc_skb+0x70/0x136 [<ffffffff8042e518>] tcp_send_fin+0x18e/0x19a [<ffffffff80420358>] tcp_close+0x1cc/0x413 [<ffffffff8043d1a6>] inet_release+0x55/0x5c [<ffffffff803e78af>] sock_release+0x1f/0xb2 [<ffffffff803e797b>] sock_close+0x39/0x3f [<ffffffff802bb11b>] __fput+0xca/0x18d [<ffffffff802bb1f7>] fput+0x19/0x1b [<ffffffff802b8120>] filp_close+0x6b/0x76 [<ffffffff802b81d5>] sys_close+0xaa/0xe9 [<ffffffff80226c07>] sysenter_do_call+0x8c/0x149 [<ffffffff8046ca94>] ? trace_hardirqs_on_thunk+0x3a/0x3c --------------------------- | preempt count: 00000001 ] | 1-level deep critical section nesting: ---------------------------------------- .. [<ffffffff80346859>] .... debug_smp_processor_id+0x91/0xf4 .....[<ffffffff8040673f>] .. ( <= __qdisc_run+0x160/0x1e9) BUG: firefox-bin:4091 task might have lost a preemption check! Pid: 4091, comm: firefox-bin Tainted: G W 2.6.26.1-rt1.jk #4 Call Trace: [<ffffffff804708c8>] ? sub_preempt_count+0xd1/0xe6 [<ffffffff80235c41>] preempt_enable_no_resched+0x5c/0x5e [<ffffffff803468b1>] debug_smp_processor_id+0xe9/0xf4 [<ffffffff8040673f>] __qdisc_run+0x160/0x1e9 [<ffffffff803f406c>] dev_queue_xmit+0x1b3/0x2ee [<ffffffff8041abec>] ip_finish_output+0x2a6/0x2ef [<ffffffff8041ad18>] ip_output+0xe3/0xec [<ffffffff80419c54>] ip_local_out+0x25/0x29 [<ffffffff8041a4c6>] ip_queue_xmit+0x2ce/0x35e [<ffffffff8042e274>] ? __tcp_push_pending_frames+0x74a/0x860 [<ffffffff8042e274>] ? __tcp_push_pending_frames+0x74a/0x860 [<ffffffff80286265>] ? trace_preempt_on+0x1f/0x105 [<ffffffff8042b6b2>] ? tcp_transmit_skb+0x72a/0x78f [<ffffffff8042e274>] ? __tcp_push_pending_frames+0x74a/0x860 [<ffffffff8042b6d8>] tcp_transmit_skb+0x750/0x78f [<ffffffff8042e274>] __tcp_push_pending_frames+0x74a/0x860 [<ffffffff802b53b3>] ? __kmalloc_node+0x48/0x4a [<ffffffff803ee44d>] ? __alloc_skb+0x70/0x136 [<ffffffff8042e518>] tcp_send_fin+0x18e/0x19a [<ffffffff80420358>] tcp_close+0x1cc/0x413 [<ffffffff8043d1a6>] inet_release+0x55/0x5c [<ffffffff803e78af>] sock_release+0x1f/0xb2 [<ffffffff803e797b>] sock_close+0x39/0x3f [<ffffffff802bb11b>] __fput+0xca/0x18d [<ffffffff802bb1f7>] fput+0x19/0x1b [<ffffffff802b8120>] filp_close+0x6b/0x76 [<ffffffff802b81d5>] sys_close+0xaa/0xe9 [<ffffffff80226c07>] sysenter_do_call+0x8c/0x149 [<ffffffff8046ca94>] ? trace_hardirqs_on_thunk+0x3a/0x3c --------------------------- | preempt count: 00000000 ] | 0-level deep critical section nesting: __qdisc_run() calls qdisc_restart() which calls handle_dev_cpu_collision(skb, dev, q); and then the problem shows up here: __get_cpu_var(netdev_rx_stat).cpu_collision++; The solution is to disable interrupts around the above increment. Here is an attached patch to do so. (Thank's to Peter Zijlstra for help in the analysis and dropping the answer in my lap, so if I got it right it is due to his help, but if I messed it up, then I did that part all by myself.) Unless there are objections, please apply.
Subject: fix for BUG: using smp_processor_id() in preemptible code Fixes using smp_processor_id() in preemptible code as seen when __qdisc_run calls qdisc_restart which calls handle_dev_cpu_collision This is fixed by disabling irqs (and preemption) around cpu_collision++ in handle_dev_cpu_collision Signed-off-by: John Kacur <jkacur at gmail dot com> Index: linux-2.6.26.1-rt1.jk/net/sched/sch_generic.c =================================================================== --- linux-2.6.26.1-rt1.jk.orig/net/sched/sch_generic.c +++ linux-2.6.26.1-rt1.jk/net/sched/sch_generic.c @@ -94,6 +94,7 @@ static inline int handle_dev_cpu_collisi struct Qdisc *q) { int ret; + unsigned long flags; if (unlikely(dev->xmit_lock_owner == (void *)current)) { /* @@ -112,7 +113,9 @@ static inline int handle_dev_cpu_collisi * Another cpu is holding lock, requeue & delay xmits for * some time. */ + local_irq_save(flags); __get_cpu_var(netdev_rx_stat).cpu_collision++; + local_irq_restore(flags); ret = dev_requeue_skb(skb, dev, q); }