rxe hangs on netdev driver unload

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hey,

If I rmmod the net device that is being used by an rxe device, then the
rmmod process hangs unregistering the ib device.  Some sort of deadlock. 

If I change rxe_notify() to defer the rxe_remove() call to the kernel
global workq thread, then this deadlock is avoided.  But I'm not sure
why and if this isn't some rdma core bug? 

There are 3 threads that seem to be stuck in
rxe_query_port()/ib_get_ethspeed(), a few others stuck grabbing a mutex,
and then the rmmod thread that is stuck in flush_workqueue() during
ib_device unregistration.

Anybody familiar with this issue?  

[  739.922232] INFO: task kworker/9:1:113 blocked for more than 120 seconds.
[  739.929794]       Not tainted 4.20.0-rc1-newlink+ #192
[  739.935701] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  739.944302] kworker/9:1     D    0   113      2 0x80000000
[  739.950568] Workqueue: ipv6_addrconf addrconf_verify_work
[  739.956749] Call Trace:
[  739.959977]  ? __schedule+0x322/0x9c0
[  739.964415]  ? __mutex_lock+0x343/0x7d0
[  739.969028]  schedule+0x32/0x80
[  739.972933]  schedule_preempt_disabled+0xa/0x10
[  739.978223]  __mutex_lock+0x348/0x7d0
[  739.982639]  ? __mutex_lock+0x130/0x7d0
[  739.987231]  ? addrconf_verify_work+0xa/0x20
[  739.992264]  addrconf_verify_work+0xa/0x20
[  739.997108]  process_one_work+0x203/0x4a0
[  740.001857]  ? process_one_work+0x17a/0x4a0
[  740.006784]  worker_thread+0x46/0x3e0
[  740.011190]  kthread+0xf8/0x130
[  740.015053]  ? process_one_work+0x4a0/0x4a0
[  740.019951]  ? kthread_stop+0x140/0x140
[  740.024489]  ret_from_fork+0x35/0x40
[  740.028758] INFO: task kworker/13:1:117 blocked for more than 120
seconds.
[  740.036322]       Not tainted 4.20.0-rc1-newlink+ #192
[  740.042147] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  740.050663] kworker/13:1    D    0   117      2 0x80000000
[  740.056846] Workqueue: infiniband ib_cache_task [ib_core]
[  740.062918] Call Trace:
[  740.066025]  ? __schedule+0x322/0x9c0
[  740.070339]  ? __mutex_lock+0x343/0x7d0
[  740.074812]  schedule+0x32/0x80
[  740.078576]  schedule_preempt_disabled+0xa/0x10
[  740.083724]  __mutex_lock+0x348/0x7d0
[  740.087992]  ? __mutex_lock+0x130/0x7d0
[  740.092424]  ? rxe_query_port+0x83/0xf0 [rdma_rxe]
[  740.097805]  rxe_query_port+0x83/0xf0 [rdma_rxe]
[  740.103006]  ib_query_port+0xc9/0x1a0 [ib_core]
[  740.108106]  ? _cond_resched+0x15/0x30
[  740.112420]  ? kmem_cache_alloc_trace+0x196/0x200
[  740.117680]  ib_cache_update.isra.13.part.14+0x5c/0x340 [ib_core]
[  740.124322]  ? process_one_work+0x17a/0x4a0
[  740.129053]  ib_cache_task+0x16/0x20 [ib_core]
[  740.134027]  process_one_work+0x203/0x4a0
[  740.138567]  ? process_one_work+0x17a/0x4a0
[  740.143266]  worker_thread+0x46/0x3e0
[  740.147431]  kthread+0xf8/0x130
[  740.151060]  ? process_one_work+0x4a0/0x4a0
[  740.155734]  ? kthread_stop+0x140/0x140
[  740.160054]  ret_from_fork+0x35/0x40
[  740.164112] INFO: task kworker/7:1:120 blocked for more than 120 seconds.
[  740.171382]       Not tainted 4.20.0-rc1-newlink+ #192
[  740.177000] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  740.185317] kworker/7:1     D    0   120      2 0x80000000
[  740.191286] Workqueue: infiniband ib_cache_task [ib_core]
[  740.197168] Call Trace:
[  740.200112]  ? __schedule+0x322/0x9c0
[  740.204260]  ? __mutex_lock+0x343/0x7d0
[  740.208591]  schedule+0x32/0x80
[  740.212214]  schedule_preempt_disabled+0xa/0x10
[  740.217227]  __mutex_lock+0x348/0x7d0
[  740.221378]  ? __mutex_lock+0x130/0x7d0
[  740.225707]  ? ib_get_eth_speed+0x6e/0x150 [ib_core]
[  740.231174]  ib_get_eth_speed+0x6e/0x150 [ib_core]
[  740.236459]  ? update_curr+0x17e/0x250
[  740.240694]  ? find_held_lock+0x2d/0x90
[  740.245021]  rxe_query_port+0x98/0xf0 [rdma_rxe]
[  740.250141]  ib_query_port+0xc9/0x1a0 [ib_core]
[  740.255170]  ? _cond_resched+0x15/0x30
[  740.259416]  ? kmem_cache_alloc_trace+0x196/0x200
[  740.264628]  ib_cache_update.isra.13.part.14+0x5c/0x340 [ib_core]
[  740.271215]  ? process_one_work+0x17a/0x4a0
[  740.275893]  ib_cache_task+0x16/0x20 [ib_core]
[  740.280820]  process_one_work+0x203/0x4a0
[  740.285300]  ? process_one_work+0x17a/0x4a0
[  740.289952]  worker_thread+0x46/0x3e0
[  740.294082]  kthread+0xf8/0x130
[  740.297684]  ? process_one_work+0x4a0/0x4a0
[  740.302334]  ? kthread_stop+0x140/0x140
[  740.306634]  ret_from_fork+0x35/0x40
[  740.310690] INFO: task kworker/7:2:15203 blocked for more than 120
seconds.
[  740.318125]       Not tainted 4.20.0-rc1-newlink+ #192
[  740.323750] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  740.332078] kworker/7:2     D    0 15203      2 0x80000080
[  740.338068] Workqueue: events linkwatch_event
[  740.342923] Call Trace:
[  740.345869]  ? __schedule+0x322/0x9c0
[  740.350029]  ? __mutex_lock+0x343/0x7d0
[  740.354360]  schedule+0x32/0x80
[  740.357991]  schedule_preempt_disabled+0xa/0x10
[  740.363012]  __mutex_lock+0x348/0x7d0
[  740.367160]  ? __mutex_lock+0x130/0x7d0
[  740.371478]  ? linkwatch_event+0xa/0x30
[  740.375796]  linkwatch_event+0xa/0x30
[  740.379946]  process_one_work+0x203/0x4a0
[  740.384434]  ? process_one_work+0x17a/0x4a0
[  740.389101]  worker_thread+0x46/0x3e0
[  740.393250]  kthread+0xf8/0x130
[  740.396865]  ? process_one_work+0x4a0/0x4a0
[  740.401522]  ? kthread_stop+0x140/0x140
[  740.405824]  ret_from_fork+0x35/0x40
[  740.409854] INFO: task rmmod:15311 blocked for more than 120 seconds.
[  740.416757]       Not tainted 4.20.0-rc1-newlink+ #192
[  740.422349] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  740.430649] rmmod           D    0 15311  14253 0x00000080
[  740.436613] Call Trace:
[  740.439538]  ? __schedule+0x322/0x9c0
[  740.443682]  ? ep_poll_callback+0x2b5/0x370
[  740.448340]  schedule+0x32/0x80
[  740.451966]  schedule_timeout+0x221/0x3d0
[  740.456450]  ? find_held_lock+0x2d/0x90
[  740.460755]  ? wait_for_completion+0x108/0x170
[  740.465670]  wait_for_completion+0x110/0x170
[  740.470411]  ? wake_up_q+0x70/0x70
[  740.474283]  flush_workqueue+0x1a7/0x540
[  740.478677]  ? flush_workqueue+0x76/0x540
[  740.483754]  ib_cache_cleanup_one+0x41/0x180 [ib_core]
[  740.489360]  ? __mutex_unlock_slowpath+0x4b/0x290
[  740.494537]  ib_unregister_device+0x126/0x1f0 [ib_core]
[  740.500227]  rxe_remove+0xe/0x30 [rdma_rxe]
[  740.504887]  rxe_notify+0x7f/0x130 [rdma_rxe]
[  740.509715]  notifier_call_chain+0x47/0x70
[  740.514283]  rollback_registered_many+0x275/0x500
[  740.519462]  rollback_registered+0x56/0x90
[  740.524032]  unregister_netdevice_queue+0x7e/0x100
[  740.529298]  unregister_netdev+0x18/0x20
[  740.533694]  remove_one+0x1b9/0x1f0 [cxgb4]
[  740.538347]  pci_device_remove+0x3b/0xc0
[  740.542734]  device_release_driver_internal+0x175/0x240
[  740.548420]  driver_detach+0x3a/0x70
[  740.552446]  bus_remove_driver+0x58/0xd0
[  740.556805]  pci_unregister_driver+0x26/0xa0
[  740.561515]  cxgb4_cleanup_module+0x28/0x61 [cxgb4]
[  740.566835]  __x64_sys_delete_module+0x1e2/0x240
[  740.571898]  do_syscall_64+0x5b/0x190
[  740.576008]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  740.581507] RIP: 0033:0x7f02bc091397
[  740.585529] Code: Bad RIP value.
[  740.589199] RSP: 002b:00007fffb8f9a458 EFLAGS: 00000202 ORIG_RAX:
00000000000000b0
[  740.597231] RAX: ffffffffffffffda RBX: 000000000081e220 RCX:
00007f02bc091397
[  740.604832] RDX: 00007f02bc105b20 RSI: 0000000000000800 RDI:
000000000081e288
[  740.612438] RBP: 0000000000000000 R08: 00007f02bc35a060 R09:
00007f02bc105b20
[  740.620052] R10: 00007fffb8f99ee0 R11: 0000000000000202 R12:
00007fffb8f9c751
[  740.627666] R13: 0000000000000000 R14: 000000000081e220 R15:
000000000081e010
[  740.635289]
Showing all locks held in the system:
[  740.642434] 1 lock held by khungtaskd/97:
[  740.646928]  #0: 00000000354ce682 (rcu_read_lock){....}, at:
debug_show_all_locks+0xe/0x110
[  740.655793] 3 locks held by kworker/9:1/113:
[  740.660561]  #0: 000000001c09232d
((wq_completion)"%s"("ipv6_addrconf")){....}, at:
process_one_work+0x17a/0x4a0
[  740.671261]  #1: 00000000614c2ae5 ((addr_chk_work).work){....}, at:
process_one_work+0x17a/0x4a0
[  740.680565]  #2: 0000000067405b77 (rtnl_mutex){....}, at:
addrconf_verify_work+0xa/0x20
[  740.689093] 3 locks held by kworker/13:1/117:
[  740.693968]  #0: 0000000093301381
((wq_completion)"infiniband"){....}, at: process_one_work+0x17a/0x4a0
[  740.703911]  #1: 00000000fb2af2df
((work_completion)(&work->work)#2){....}, at: process_one_work+0x17a/0x4a0
[  740.714307]  #2: 00000000ff63eb3d (&rxe->usdev_lock){....}, at:
rxe_query_port+0x83/0xf0 [rdma_rxe]
[  740.723924] 4 locks held by kworker/7:1/120:
[  740.728767]  #0: 0000000093301381
((wq_completion)"infiniband"){....}, at: process_one_work+0x17a/0x4a0
[  740.738758]  #1: 00000000fdfc413f
((work_completion)(&work->work)#2){....}, at: process_one_work+0x17a/0x4a0
[  740.749187]  #2: 00000000ff63eb3d (&rxe->usdev_lock){....}, at:
rxe_query_port+0x83/0xf0 [rdma_rxe]
[  740.758851]  #3: 0000000067405b77 (rtnl_mutex){....}, at:
ib_get_eth_speed+0x6e/0x150 [ib_core]
[  740.768182] 2 locks held by agetty/7388:
[  740.772735]  #0: 00000000a52f1433 (&tty->ldisc_sem){....}, at:
tty_ldisc_ref_wait+0x24/0x50
[  740.781728]  #1: 00000000b21ff732 (&ldata->atomic_read_lock){....},
at: n_tty_read+0xd8/0x8f0
[  740.790908] 3 locks held by kworker/7:2/15203:
[  740.795999]  #0: 00000000a7e503ae ((wq_completion)"events"){....},
at: process_one_work+0x17a/0x4a0
[  740.805714]  #1: 00000000d09d5f76 ((linkwatch_work).work){....}, at:
process_one_work+0x17a/0x4a0
[  740.815266]  #2: 0000000067405b77 (rtnl_mutex){....}, at:
linkwatch_event+0xa/0x30
[  740.823521] 2 locks held by rmmod/15311:
[  740.828133]  #0: 00000000923b1412 (&dev->mutex){....}, at:
device_release_driver_internal+0x39/0x240
[  740.837966]  #1: 0000000067405b77 (rtnl_mutex){....}, at:
unregister_netdev+0xe/0x20
[  740.846425] 2 locks held by bash/15323:
[  740.850971]  #0: 00000000be5388b9 (&tty->ldisc_sem){....}, at:
tty_ldisc_ref_wait+0x24/0x50
[  740.860054]  #1: 0000000098aaed68 (&ldata->atomic_read_lock){....},
at: n_tty_read+0xd8/0x8f0

[  740.871555] =============================================




[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux