On Tue, Apr 06, 2010 at 12:38:16PM +0800, Cong Wang wrote: > Cong Wang wrote: >> Before I try to reproduce it, could you please try to replace the >> 'read_lock()' >> in slaves_support_netpoll() with 'read_lock_bh()'? (read_unlock() too) >> Try if this helps. >> > > Confirmed. Please use the attached patch instead, for your testing. > > Thanks! > Moving those locks to bh-locks will not resolve this. I tried that yesterday and tried your new patch today without success. That warning is a WARN_ON_ONCE so you need to reboot to see that it is still a problem. Simply unloading and loading the new module is not an accurate test. Also, my system still hangs when removing the bonding module. I do not think you intended to fix this with the patch, but wanted it to be clear to everyone on the list. You should also configure your kernel with a some of the lock debugging enabled. I've been using the following: CONFIG_DETECT_HUNG_TASK=y CONFIG_DEBUG_SPINLOCK=y CONFIG_DEBUG_MUTEXES=y CONFIG_DEBUG_LOCK_ALLOC=y CONFIG_PROVE_LOCKING=y CONFIG_LOCKDEP=y CONFIG_LOCK_STAT=y CONFIG_DEBUG_LOCKDEP=y Here is the output when I remove a slave from the bond. My xmit_roundrobin patch from earlier (replacing read_lock with read_trylock) was applied. It might be helpful for you when debugging these issues. ------------[ cut here ]------------ WARNING: at kernel/softirq.c:143 local_bh_enable+0x43/0xba() Hardware name: HP xw4400 Workstation Modules linked in: netconsole bonding ipt_REJECT bridge stp autofs4 i2c_dev i2c_core hidp rfcomm l2cap crc16 bluetooth rfki] Pid: 10, comm: events/1 Not tainted 2.6.34-rc3 #6 Call Trace: [<ffffffff81058754>] ? cpu_clock+0x2d/0x41 [<ffffffff810404d9>] ? local_bh_enable+0x43/0xba [<ffffffff8103a350>] warn_slowpath_common+0x77/0x8f [<ffffffff812a4659>] ? dev_queue_xmit+0x408/0x467 [<ffffffff8103a377>] warn_slowpath_null+0xf/0x11 [<ffffffff810404d9>] local_bh_enable+0x43/0xba [<ffffffff812a4659>] dev_queue_xmit+0x408/0x467 [<ffffffff812a435e>] ? dev_queue_xmit+0x10d/0x467 [<ffffffffa04a383f>] bond_dev_queue_xmit+0x1cd/0x1f9 [bonding] [<ffffffffa04a41ee>] bond_start_xmit+0x139/0x3e9 [bonding] [<ffffffff812b0e9a>] queue_process+0xa8/0x160 [<ffffffff812b0df2>] ? queue_process+0x0/0x160 [<ffffffff81050bfb>] worker_thread+0x1af/0x2ae [<ffffffff81050ba2>] ? worker_thread+0x156/0x2ae [<ffffffff81053c34>] ? autoremove_wake_function+0x0/0x38 [<ffffffff81050a4c>] ? worker_thread+0x0/0x2ae [<ffffffff81053901>] kthread+0x7d/0x85 [<ffffffff81003794>] kernel_thread_helper+0x4/0x10 [<ffffffff813362bc>] ? restore_args+0x0/0x30 [<ffffffff81053884>] ? kthread+0x0/0x85 [<ffffffff81003790>] ? kernel_thread_helper+0x0/0x10 ---[ end trace 241f49bf65e0f4f0 ]--- ========================================================= [ INFO: possible irq lock inversion dependency detected ] 2.6.34-rc3 #6 --------------------------------------------------------- events/1/10 just changed the state of lock: (&bonding_netdev_xmit_lock_key){+.+...}, at: [<ffffffff812b0e75>] queue_process+0x83/0x160 but this lock was taken by another, SOFTIRQ-safe lock in the past: (&(&dev->tx_global_lock)->rlock){+.-...} and interrupts could create inverse lock ordering between them. other info that might help us debug this: 4 locks held by events/1/10: #0: (events){+.+.+.}, at: [<ffffffff81050ba2>] worker_thread+0x156/0x2ae #1: ((&(&npinfo->tx_work)->work)){+.+...}, at: [<ffffffff81050ba2>] worker_thread+0x156/0x2ae #2: (&bonding_netdev_xmit_lock_key){+.+...}, at: [<ffffffff812b0e75>] queue_process+0x83/0x160 #3: (&bond->lock){++.+..}, at: [<ffffffffa04a4107>] bond_start_xmit+0x52/0x3e9 [bonding] the shortest dependencies between 2nd lock and 1st lock: -> (&(&dev->tx_global_lock)->rlock){+.-...} ops: 129 { HARDIRQ-ON-W at: [<ffffffff810651ef>] __lock_acquire+0x643/0x813 [<ffffffff81065487>] lock_acquire+0xc8/0xed [<ffffffff81335742>] _raw_spin_lock+0x31/0x66 [<ffffffff812b64bd>] dev_deactivate+0x6f/0x195 [<ffffffff812ad7c4>] linkwatch_do_dev+0x9a/0xae [<ffffffff812ada6a>] __linkwatch_run_queue+0x106/0x14a [<ffffffff812adad8>] linkwatch_event+0x2a/0x31 [<ffffffff81050bfb>] worker_thread+0x1af/0x2ae [<ffffffff81053901>] kthread+0x7d/0x85 [<ffffffff81003794>] kernel_thread_helper+0x4/0x10 IN-SOFTIRQ-W at: [<ffffffff810651a3>] __lock_acquire+0x5f7/0x813 [<ffffffff81065487>] lock_acquire+0xc8/0xed [<ffffffff81335742>] _raw_spin_lock+0x31/0x66 [<ffffffff812b6606>] dev_watchdog+0x23/0x1f2 [<ffffffff8104701b>] run_timer_softirq+0x1d1/0x285 [<ffffffff81040021>] __do_softirq+0xdb/0x1ab [<ffffffff8100388c>] call_softirq+0x1c/0x34 [<ffffffff81004f9d>] do_softirq+0x38/0x83 [<ffffffff8103ff44>] irq_exit+0x45/0x47 [<ffffffff810193bc>] smp_apic_timer_interrupt+0x88/0x98 [<ffffffff81003353>] apic_timer_interrupt+0x13/0x20 [<ffffffff81001a21>] cpu_idle+0x4d/0x6b [<ffffffff8131da3a>] rest_init+0xbe/0xc2 [<ffffffff81a00d4e>] start_kernel+0x38c/0x399 [<ffffffff81a002a5>] x86_64_start_reservations+0xb5/0xb9 [<ffffffff81a0038f>] x86_64_start_kernel+0xe6/0xed INITIAL USE at: [<ffffffff8106525c>] __lock_acquire+0x6b0/0x813 [<ffffffff81065487>] lock_acquire+0xc8/0xed [<ffffffff81335742>] _raw_spin_lock+0x31/0x66 [<ffffffff812b64bd>] dev_deactivate+0x6f/0x195 [<ffffffff812ad7c4>] linkwatch_do_dev+0x9a/0xae [<ffffffff812ada6a>] __linkwatch_run_queue+0x106/0x14a [<ffffffff812adad8>] linkwatch_event+0x2a/0x31 [<ffffffff81050bfb>] worker_thread+0x1af/0x2ae [<ffffffff81053901>] kthread+0x7d/0x85 [<ffffffff81003794>] kernel_thread_helper+0x4/0x10 } ... key at: [<ffffffff8282ceb0>] __key.51521+0x0/0x8 ... acquired at: [<ffffffff810649f9>] validate_chain+0xb87/0xd3a [<ffffffff81065359>] __lock_acquire+0x7ad/0x813 [<ffffffff81065487>] lock_acquire+0xc8/0xed [<ffffffff81335742>] _raw_spin_lock+0x31/0x66 [<ffffffff812b64e4>] dev_deactivate+0x96/0x195 [<ffffffff812a17fc>] __dev_close+0x69/0x86 [<ffffffff8129f8ed>] __dev_change_flags+0xa8/0x12b [<ffffffff812a148c>] dev_change_flags+0x1c/0x51 [<ffffffff812eee8a>] devinet_ioctl+0x26e/0x5d0 [<ffffffff812ef978>] inet_ioctl+0x8a/0xa2 [<ffffffff8128fc28>] sock_do_ioctl+0x26/0x45 [<ffffffff8128fe5a>] sock_ioctl+0x213/0x226 [<ffffffff810e5988>] vfs_ioctl+0x2a/0x9d [<ffffffff810e5f13>] do_vfs_ioctl+0x491/0x4e2 [<ffffffff810e5fbb>] sys_ioctl+0x57/0x7a [<ffffffff8100296b>] system_call_fastpath+0x16/0x1b -> (&bonding_netdev_xmit_lock_key){+.+...} ops: 2 { HARDIRQ-ON-W at: [<ffffffff810651ef>] __lock_acquire+0x643/0x813 [<ffffffff81065487>] lock_acquire+0xc8/0xed [<ffffffff81335742>] _raw_spin_lock+0x31/0x66 [<ffffffff812b64e4>] dev_deactivate+0x96/0x195 [<ffffffff812a17fc>] __dev_close+0x69/0x86 [<ffffffff8129f8ed>] __dev_change_flags+0xa8/0x12b [<ffffffff812a148c>] dev_change_flags+0x1c/0x51 [<ffffffff812eee8a>] devinet_ioctl+0x26e/0x5d0 [<ffffffff812ef978>] inet_ioctl+0x8a/0xa2 [<ffffffff8128fc28>] sock_do_ioctl+0x26/0x45 [<ffffffff8128fe5a>] sock_ioctl+0x213/0x226 [<ffffffff810e5988>] vfs_ioctl+0x2a/0x9d [<ffffffff810e5f13>] do_vfs_ioctl+0x491/0x4e2 [<ffffffff810e5fbb>] sys_ioctl+0x57/0x7a [<ffffffff8100296b>] system_call_fastpath+0x16/0x1b SOFTIRQ-ON-W at: [<ffffffff81062006>] mark_held_locks+0x49/0x69 [<ffffffff81062139>] trace_hardirqs_on_caller+0x113/0x13e [<ffffffff81062171>] trace_hardirqs_on+0xd/0xf [<ffffffff81040548>] local_bh_enable+0xb2/0xba [<ffffffff812a4659>] dev_queue_xmit+0x408/0x467 [<ffffffffa04a383f>] bond_dev_queue_xmit+0x1cd/0x1f9 [bonding] [<ffffffffa04a41ee>] bond_start_xmit+0x139/0x3e9 [bonding] [<ffffffff812b0e9a>] queue_process+0xa8/0x160 [<ffffffff81050bfb>] worker_thread+0x1af/0x2ae [<ffffffff81053901>] kthread+0x7d/0x85 [<ffffffff81003794>] kernel_thread_helper+0x4/0x10 INITIAL USE at: [<ffffffff8106525c>] __lock_acquire+0x6b0/0x813 [<ffffffff81065487>] lock_acquire+0xc8/0xed [<ffffffff81335742>] _raw_spin_lock+0x31/0x66 [<ffffffff812b64e4>] dev_deactivate+0x96/0x195 [<ffffffff812a17fc>] __dev_close+0x69/0x86 [<ffffffff8129f8ed>] __dev_change_flags+0xa8/0x12b [<ffffffff812a148c>] dev_change_flags+0x1c/0x51 [<ffffffff812eee8a>] devinet_ioctl+0x26e/0x5d0 [<ffffffff812ef978>] inet_ioctl+0x8a/0xa2 [<ffffffff8128fc28>] sock_do_ioctl+0x26/0x45 [<ffffffff8128fe5a>] sock_ioctl+0x213/0x226 [<ffffffff810e5988>] vfs_ioctl+0x2a/0x9d [<ffffffff810e5f13>] do_vfs_ioctl+0x491/0x4e2 [<ffffffff810e5fbb>] sys_ioctl+0x57/0x7a [<ffffffff8100296b>] system_call_fastpath+0x16/0x1b } ... key at: [<ffffffffa04b1968>] bonding_netdev_xmit_lock_key+0x0/0xffffffffffffa78c [bonding] ... acquired at: [<ffffffff8106386d>] check_usage_backwards+0xb8/0xc7 [<ffffffff81061d81>] mark_lock+0x311/0x54d [<ffffffff81062006>] mark_held_locks+0x49/0x69 [<ffffffff81062139>] trace_hardirqs_on_caller+0x113/0x13e [<ffffffff81062171>] trace_hardirqs_on+0xd/0xf [<ffffffff81040548>] local_bh_enable+0xb2/0xba [<ffffffff812a4659>] dev_queue_xmit+0x408/0x467 [<ffffffffa04a383f>] bond_dev_queue_xmit+0x1cd/0x1f9 [bonding] [<ffffffffa04a41ee>] bond_start_xmit+0x139/0x3e9 [bonding] [<ffffffff812b0e9a>] queue_process+0xa8/0x160 [<ffffffff81050bfb>] worker_thread+0x1af/0x2ae [<ffffffff81053901>] kthread+0x7d/0x85 [<ffffffff81003794>] kernel_thread_helper+0x4/0x10 stack backtrace: Pid: 10, comm: events/1 Tainted: G W 2.6.34-rc3 #6 Call Trace: [<ffffffff8106189e>] print_irq_inversion_bug+0x121/0x130 [<ffffffff8106386d>] check_usage_backwards+0xb8/0xc7 [<ffffffff810637b5>] ? check_usage_backwards+0x0/0xc7 [<ffffffff81061d81>] mark_lock+0x311/0x54d [<ffffffff81062006>] mark_held_locks+0x49/0x69 [<ffffffff81040548>] ? local_bh_enable+0xb2/0xba [<ffffffff81062139>] trace_hardirqs_on_caller+0x113/0x13e [<ffffffff812a4659>] ? dev_queue_xmit+0x408/0x467 [<ffffffff81062171>] trace_hardirqs_on+0xd/0xf [<ffffffff81040548>] local_bh_enable+0xb2/0xba [<ffffffff812a4659>] dev_queue_xmit+0x408/0x467 [<ffffffff812a435e>] ? dev_queue_xmit+0x10d/0x467 [<ffffffffa04a383f>] bond_dev_queue_xmit+0x1cd/0x1f9 [bonding] [<ffffffffa04a41ee>] bond_start_xmit+0x139/0x3e9 [bonding] [<ffffffff812b0e9a>] queue_process+0xa8/0x160 [<ffffffff812b0df2>] ? queue_process+0x0/0x160 [<ffffffff81050bfb>] worker_thread+0x1af/0x2ae [<ffffffff81050ba2>] ? worker_thread+0x156/0x2ae [<ffffffff81053c34>] ? autoremove_wake_function+0x0/0x38 [<ffffffff81050a4c>] ? worker_thread+0x0/0x2ae [<ffffffff81053901>] kthread+0x7d/0x85 [<ffffffff81003794>] kernel_thread_helper+0x4/0x10 [<ffffffff813362bc>] ? restore_args+0x0/0x30 [<ffffffff81053884>] ? kthread+0x0/0x85 [<ffffffff81003790>] ? kernel_thread_helper+0x0/0x10 Dead loop on virtual device bond0, fix it urgently! _______________________________________________ Bridge mailing list Bridge@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linux-foundation.org/mailman/listinfo/bridge