On 12/3/2018 3:39 PM, Mark Bloch wrote: > On 12/3/18 1:28 PM, Steve Wise wrote: >> Hey, >> >> If I rmmod the net device that is being used by an rxe device, then the >> rmmod process hangs unregistering the ib device. Some sort of deadlock. >> >> If I change rxe_notify() to defer the rxe_remove() call to the kernel >> global workq thread, then this deadlock is avoided. But I'm not sure >> why and if this isn't some rdma core bug? >> >> There are 3 threads that seem to be stuck in >> rxe_query_port()/ib_get_ethspeed(), a few others stuck grabbing a mutex, >> and then the rmmod thread that is stuck in flush_workqueue() during >> ib_device unregistration. >> >> Anybody familiar with this issue? >> >> [ 739.922232] INFO: task kworker/9:1:113 blocked for more than 120 seconds. >> [ 739.929794] Not tainted 4.20.0-rc1-newlink+ #192 >> [ 739.935701] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" >> disables this message. >> [ 739.944302] kworker/9:1 D 0 113 2 0x80000000 >> [ 739.950568] Workqueue: ipv6_addrconf addrconf_verify_work >> [ 739.956749] Call Trace: >> [ 739.959977] ? __schedule+0x322/0x9c0 >> [ 739.964415] ? __mutex_lock+0x343/0x7d0 >> [ 739.969028] schedule+0x32/0x80 >> [ 739.972933] schedule_preempt_disabled+0xa/0x10 >> [ 739.978223] __mutex_lock+0x348/0x7d0 >> [ 739.982639] ? __mutex_lock+0x130/0x7d0 >> [ 739.987231] ? addrconf_verify_work+0xa/0x20 >> [ 739.992264] addrconf_verify_work+0xa/0x20 >> [ 739.997108] process_one_work+0x203/0x4a0 >> [ 740.001857] ? process_one_work+0x17a/0x4a0 >> [ 740.006784] worker_thread+0x46/0x3e0 >> [ 740.011190] kthread+0xf8/0x130 >> [ 740.015053] ? process_one_work+0x4a0/0x4a0 >> [ 740.019951] ? kthread_stop+0x140/0x140 >> [ 740.024489] ret_from_fork+0x35/0x40 >> [ 740.028758] INFO: task kworker/13:1:117 blocked for more than 120 >> seconds. >> [ 740.036322] Not tainted 4.20.0-rc1-newlink+ #192 >> [ 740.042147] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" >> disables this message. >> [ 740.050663] kworker/13:1 D 0 117 2 0x80000000 >> [ 740.056846] Workqueue: infiniband ib_cache_task [ib_core] >> [ 740.062918] Call Trace: >> [ 740.066025] ? __schedule+0x322/0x9c0 >> [ 740.070339] ? __mutex_lock+0x343/0x7d0 >> [ 740.074812] schedule+0x32/0x80 >> [ 740.078576] schedule_preempt_disabled+0xa/0x10 >> [ 740.083724] __mutex_lock+0x348/0x7d0 >> [ 740.087992] ? __mutex_lock+0x130/0x7d0 >> [ 740.092424] ? rxe_query_port+0x83/0xf0 [rdma_rxe] >> [ 740.097805] rxe_query_port+0x83/0xf0 [rdma_rxe] >> [ 740.103006] ib_query_port+0xc9/0x1a0 [ib_core] >> [ 740.108106] ? _cond_resched+0x15/0x30 >> [ 740.112420] ? kmem_cache_alloc_trace+0x196/0x200 >> [ 740.117680] ib_cache_update.isra.13.part.14+0x5c/0x340 [ib_core] >> [ 740.124322] ? process_one_work+0x17a/0x4a0 >> [ 740.129053] ib_cache_task+0x16/0x20 [ib_core] >> [ 740.134027] process_one_work+0x203/0x4a0 >> [ 740.138567] ? process_one_work+0x17a/0x4a0 >> [ 740.143266] worker_thread+0x46/0x3e0 >> [ 740.147431] kthread+0xf8/0x130 >> [ 740.151060] ? process_one_work+0x4a0/0x4a0 >> [ 740.155734] ? kthread_stop+0x140/0x140 >> [ 740.160054] ret_from_fork+0x35/0x40 >> [ 740.164112] INFO: task kworker/7:1:120 blocked for more than 120 seconds. >> [ 740.171382] Not tainted 4.20.0-rc1-newlink+ #192 >> [ 740.177000] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" >> disables this message. >> [ 740.185317] kworker/7:1 D 0 120 2 0x80000000 >> [ 740.191286] Workqueue: infiniband ib_cache_task [ib_core] >> [ 740.197168] Call Trace: >> [ 740.200112] ? __schedule+0x322/0x9c0 >> [ 740.204260] ? __mutex_lock+0x343/0x7d0 >> [ 740.208591] schedule+0x32/0x80 >> [ 740.212214] schedule_preempt_disabled+0xa/0x10 >> [ 740.217227] __mutex_lock+0x348/0x7d0 >> [ 740.221378] ? __mutex_lock+0x130/0x7d0 >> [ 740.225707] ? ib_get_eth_speed+0x6e/0x150 [ib_core] >> [ 740.231174] ib_get_eth_speed+0x6e/0x150 [ib_core] >> [ 740.236459] ? update_curr+0x17e/0x250 >> [ 740.240694] ? find_held_lock+0x2d/0x90 >> [ 740.245021] rxe_query_port+0x98/0xf0 [rdma_rxe] >> [ 740.250141] ib_query_port+0xc9/0x1a0 [ib_core] >> [ 740.255170] ? _cond_resched+0x15/0x30 >> [ 740.259416] ? kmem_cache_alloc_trace+0x196/0x200 >> [ 740.264628] ib_cache_update.isra.13.part.14+0x5c/0x340 [ib_core] >> [ 740.271215] ? process_one_work+0x17a/0x4a0 >> [ 740.275893] ib_cache_task+0x16/0x20 [ib_core] >> [ 740.280820] process_one_work+0x203/0x4a0 >> [ 740.285300] ? process_one_work+0x17a/0x4a0 >> [ 740.289952] worker_thread+0x46/0x3e0 >> [ 740.294082] kthread+0xf8/0x130 >> [ 740.297684] ? process_one_work+0x4a0/0x4a0 >> [ 740.302334] ? kthread_stop+0x140/0x140 >> [ 740.306634] ret_from_fork+0x35/0x40 >> [ 740.310690] INFO: task kworker/7:2:15203 blocked for more than 120 >> seconds. >> [ 740.318125] Not tainted 4.20.0-rc1-newlink+ #192 >> [ 740.323750] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" >> disables this message. >> [ 740.332078] kworker/7:2 D 0 15203 2 0x80000080 >> [ 740.338068] Workqueue: events linkwatch_event >> [ 740.342923] Call Trace: >> [ 740.345869] ? __schedule+0x322/0x9c0 >> [ 740.350029] ? __mutex_lock+0x343/0x7d0 >> [ 740.354360] schedule+0x32/0x80 >> [ 740.357991] schedule_preempt_disabled+0xa/0x10 >> [ 740.363012] __mutex_lock+0x348/0x7d0 >> [ 740.367160] ? __mutex_lock+0x130/0x7d0 >> [ 740.371478] ? linkwatch_event+0xa/0x30 >> [ 740.375796] linkwatch_event+0xa/0x30 >> [ 740.379946] process_one_work+0x203/0x4a0 >> [ 740.384434] ? process_one_work+0x17a/0x4a0 >> [ 740.389101] worker_thread+0x46/0x3e0 >> [ 740.393250] kthread+0xf8/0x130 >> [ 740.396865] ? process_one_work+0x4a0/0x4a0 >> [ 740.401522] ? kthread_stop+0x140/0x140 >> [ 740.405824] ret_from_fork+0x35/0x40 >> [ 740.409854] INFO: task rmmod:15311 blocked for more than 120 seconds. >> [ 740.416757] Not tainted 4.20.0-rc1-newlink+ #192 >> [ 740.422349] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" >> disables this message. >> [ 740.430649] rmmod D 0 15311 14253 0x00000080 >> [ 740.436613] Call Trace: >> [ 740.439538] ? __schedule+0x322/0x9c0 >> [ 740.443682] ? ep_poll_callback+0x2b5/0x370 >> [ 740.448340] schedule+0x32/0x80 >> [ 740.451966] schedule_timeout+0x221/0x3d0 >> [ 740.456450] ? find_held_lock+0x2d/0x90 >> [ 740.460755] ? wait_for_completion+0x108/0x170 >> [ 740.465670] wait_for_completion+0x110/0x170 >> [ 740.470411] ? wake_up_q+0x70/0x70 >> [ 740.474283] flush_workqueue+0x1a7/0x540 >> [ 740.478677] ? flush_workqueue+0x76/0x540 >> [ 740.483754] ib_cache_cleanup_one+0x41/0x180 [ib_core] >> [ 740.489360] ? __mutex_unlock_slowpath+0x4b/0x290 >> [ 740.494537] ib_unregister_device+0x126/0x1f0 [ib_core] >> [ 740.500227] rxe_remove+0xe/0x30 [rdma_rxe] >> [ 740.504887] rxe_notify+0x7f/0x130 [rdma_rxe] >> [ 740.509715] notifier_call_chain+0x47/0x70 >> [ 740.514283] rollback_registered_many+0x275/0x500 >> [ 740.519462] rollback_registered+0x56/0x90 >> [ 740.524032] unregister_netdevice_queue+0x7e/0x100 >> [ 740.529298] unregister_netdev+0x18/0x20 >> [ 740.533694] remove_one+0x1b9/0x1f0 [cxgb4] >> [ 740.538347] pci_device_remove+0x3b/0xc0 >> [ 740.542734] device_release_driver_internal+0x175/0x240 >> [ 740.548420] driver_detach+0x3a/0x70 >> [ 740.552446] bus_remove_driver+0x58/0xd0 >> [ 740.556805] pci_unregister_driver+0x26/0xa0 >> [ 740.561515] cxgb4_cleanup_module+0x28/0x61 [cxgb4] >> [ 740.566835] __x64_sys_delete_module+0x1e2/0x240 >> [ 740.571898] do_syscall_64+0x5b/0x190 >> [ 740.576008] entry_SYSCALL_64_after_hwframe+0x44/0xa9 >> [ 740.581507] RIP: 0033:0x7f02bc091397 >> [ 740.585529] Code: Bad RIP value. >> [ 740.589199] RSP: 002b:00007fffb8f9a458 EFLAGS: 00000202 ORIG_RAX: >> 00000000000000b0 >> [ 740.597231] RAX: ffffffffffffffda RBX: 000000000081e220 RCX: >> 00007f02bc091397 >> [ 740.604832] RDX: 00007f02bc105b20 RSI: 0000000000000800 RDI: >> 000000000081e288 >> [ 740.612438] RBP: 0000000000000000 R08: 00007f02bc35a060 R09: >> 00007f02bc105b20 >> [ 740.620052] R10: 00007fffb8f99ee0 R11: 0000000000000202 R12: >> 00007fffb8f9c751 >> [ 740.627666] R13: 0000000000000000 R14: 000000000081e220 R15: >> 000000000081e010 >> [ 740.635289] >> Showing all locks held in the system: >> [ 740.642434] 1 lock held by khungtaskd/97: >> [ 740.646928] #0: 00000000354ce682 (rcu_read_lock){....}, at: >> debug_show_all_locks+0xe/0x110 >> [ 740.655793] 3 locks held by kworker/9:1/113: >> [ 740.660561] #0: 000000001c09232d >> ((wq_completion)"%s"("ipv6_addrconf")){....}, at: >> process_one_work+0x17a/0x4a0 >> [ 740.671261] #1: 00000000614c2ae5 ((addr_chk_work).work){....}, at: >> process_one_work+0x17a/0x4a0 >> [ 740.680565] #2: 0000000067405b77 (rtnl_mutex){....}, at: >> addrconf_verify_work+0xa/0x20 >> [ 740.689093] 3 locks held by kworker/13:1/117: >> [ 740.693968] #0: 0000000093301381 >> ((wq_completion)"infiniband"){....}, at: process_one_work+0x17a/0x4a0 >> [ 740.703911] #1: 00000000fb2af2df >> ((work_completion)(&work->work)#2){....}, at: process_one_work+0x17a/0x4a0 >> [ 740.714307] #2: 00000000ff63eb3d (&rxe->usdev_lock){....}, at: >> rxe_query_port+0x83/0xf0 [rdma_rxe] >> [ 740.723924] 4 locks held by kworker/7:1/120: >> [ 740.728767] #0: 0000000093301381 >> ((wq_completion)"infiniband"){....}, at: process_one_work+0x17a/0x4a0 >> [ 740.738758] #1: 00000000fdfc413f >> ((work_completion)(&work->work)#2){....}, at: process_one_work+0x17a/0x4a0 >> [ 740.749187] #2: 00000000ff63eb3d (&rxe->usdev_lock){....}, at: >> rxe_query_port+0x83/0xf0 [rdma_rxe] >> [ 740.758851] #3: 0000000067405b77 (rtnl_mutex){....}, at: >> ib_get_eth_speed+0x6e/0x150 [ib_core] >> [ 740.768182] 2 locks held by agetty/7388: >> [ 740.772735] #0: 00000000a52f1433 (&tty->ldisc_sem){....}, at: >> tty_ldisc_ref_wait+0x24/0x50 >> [ 740.781728] #1: 00000000b21ff732 (&ldata->atomic_read_lock){....}, >> at: n_tty_read+0xd8/0x8f0 >> [ 740.790908] 3 locks held by kworker/7:2/15203: >> [ 740.795999] #0: 00000000a7e503ae ((wq_completion)"events"){....}, >> at: process_one_work+0x17a/0x4a0 >> [ 740.805714] #1: 00000000d09d5f76 ((linkwatch_work).work){....}, at: >> process_one_work+0x17a/0x4a0 >> [ 740.815266] #2: 0000000067405b77 (rtnl_mutex){....}, at: >> linkwatch_event+0xa/0x30 >> [ 740.823521] 2 locks held by rmmod/15311: >> [ 740.828133] #0: 00000000923b1412 (&dev->mutex){....}, at: >> device_release_driver_internal+0x39/0x240 >> [ 740.837966] #1: 0000000067405b77 (rtnl_mutex){....}, at: >> unregister_netdev+0xe/0x20 >> [ 740.846425] 2 locks held by bash/15323: >> [ 740.850971] #0: 00000000be5388b9 (&tty->ldisc_sem){....}, at: >> tty_ldisc_ref_wait+0x24/0x50 >> [ 740.860054] #1: 0000000098aaed68 (&ldata->atomic_read_lock){....}, >> at: n_tty_read+0xd8/0x8f0 >> >> [ 740.871555] ============================================= >> > ib_get_eth_speed() - tries to get rtnl lock > > addrconf_verify_work() - tries to get rtnl lock > > linkwatch_event() - tries to get rtnl lock > > When you try to remove the netdevice, I assume it calls: > unregister_netdev() - which gets the rtnl lock > > then it sends NETREG_UNREGISTERING event, which I assume rxe tries to > remove the ib device... and deadlock. > > RXE is really broken :) > > Mark indeed. :) In the deadlock above, the rmmod thread owns the rtnl lock. Yet its stuck waiting in ib_cache_cleanup_one() for the ib_wq work tasks to all get flushed. And the other processes that are stuck; the ones in ib_get_eth_speed() are tasks running on the ib_wq. And they want the rtnl lock. Thus we have the deadlock... I think the lesson here is one cannot call ib_unregister_device() with the rtnl lock held... I have a fix, I'll add it to the pile I'm accumulating to try and get the newlink/dellink feature added... Steve.