Hello Below WARNING triggered by blktests nvmeof-mp/002, it can be reproduced with the latest 5.16.0-rc3 and also exists on 5.12, pls check it.
iirc this was reported before, based on my analysis lockdep is giving a false alarm here. The reason is that the id_priv->handler_mutex cannot be the same for both cm_id that is handling the connect and the cm_id that is handling the rdma_destroy_id because rdma_destroy_id call is always called on a already disconnected cm_id, so this deadlock lockdep is complaining about cannot happen. I'm not sure how to settle this.
# use_siw=1 ./check nvmeof-mp/002 nvmeof-mp/002 (File I/O on top of multipath concurrently with logout and login (mq)) [passed] runtime 41.767s ... 36.877s # ./check nvmeof-mp/002 nvmeof-mp/002 (File I/O on top of multipath concurrently with logout and login (mq)) [passed] runtime ... 41.277s [ 609.166100] run blktests nvmeof-mp/002 at 2021-12-03 21:27:52 [ 609.977883] null_blk: module loaded [ 610.538697] rdma_rxe: loaded [ 610.602763] infiniband eno1_rxe: set active [ 610.602782] infiniband eno1_rxe: added eno1 [ 610.639520] eno2 speed is unknown, defaulting to 1000 [ 610.644946] eno2 speed is unknown, defaulting to 1000 [ 610.650318] eno2 speed is unknown, defaulting to 1000 [ 610.690546] infiniband eno2_rxe: set down [ 610.690553] infiniband eno2_rxe: added eno2 [ 610.691112] eno2 speed is unknown, defaulting to 1000 [ 610.714293] eno2 speed is unknown, defaulting to 1000 [ 610.729535] eno3 speed is unknown, defaulting to 1000 [ 610.734666] eno3 speed is unknown, defaulting to 1000 [ 610.740026] eno3 speed is unknown, defaulting to 1000 [ 610.781372] infiniband eno3_rxe: set down [ 610.781379] infiniband eno3_rxe: added eno3 [ 610.781906] eno3 speed is unknown, defaulting to 1000 [ 610.806211] eno2 speed is unknown, defaulting to 1000 [ 610.811508] eno3 speed is unknown, defaulting to 1000 [ 610.826655] eno4 speed is unknown, defaulting to 1000 [ 610.831810] eno4 speed is unknown, defaulting to 1000 [ 610.837176] eno4 speed is unknown, defaulting to 1000 [ 610.885844] infiniband eno4_rxe: set down [ 610.885850] infiniband eno4_rxe: added eno4 [ 610.886426] eno4 speed is unknown, defaulting to 1000 [ 611.266756] Loading iSCSI transport class v2.0-870. [ 611.335358] iscsi: registered transport (iser) [ 611.387969] nvmet: adding nsid 1 to subsystem nvme-test [ 611.403878] eno2 speed is unknown, defaulting to 1000 [ 611.409276] eno3 speed is unknown, defaulting to 1000 [ 611.414522] eno4 speed is unknown, defaulting to 1000 [ 611.468712] Rounding down aligned max_sectors from 4294967295 to 4294967288 [ 611.469203] db_root: cannot open: /etc/target [ 611.515816] nvmet_rdma: enabling port 1 (10.16.221.116:7777) [ 611.622158] eno2 speed is unknown, defaulting to 1000 [ 611.628243] eno3 speed is unknown, defaulting to 1000 [ 611.633422] eno4 speed is unknown, defaulting to 1000 [ 611.703648] eno2 speed is unknown, defaulting to 1000 [ 611.708949] eno3 speed is unknown, defaulting to 1000 [ 611.714267] eno4 speed is unknown, defaulting to 1000 [ 611.754520] nvme_rdma:nvme_rdma_cm_handler: nvme nvme0: address resolved (0): status 0 id 00000000d0d64a6d [ 611.765454] nvme_rdma:nvme_rdma_cm_handler: nvme nvme0: route resolved (2): status 0 id 00000000d0d64a6d [ 611.767778] nvmet_rdma:nvmet_rdma_cm_handler: nvmet_rdma: connect request (4): status 0 id 000000007ccc375a [ 611.767935] nvmet_rdma:nvmet_rdma_find_get_device: nvmet_rdma: added eno1_rxe. [ 611.777083] nvmet_rdma:nvmet_rdma_create_queue_ib: nvmet_rdma: nvmet_rdma_create_queue_ib: max_cqe= 8191 max_sge= 32 sq_size = 289 cm_id= 000000007ccc375a --snip-- [ 612.132317] nvme_rdma:nvme_rdma_cm_handler: nvme nvme0: established (9): status 0 id 00000000cede4b76 [ 612.132452] nvme nvme0: mapped 32/0/0 default/read/poll queues. [ 612.132537] nvmet_rdma:nvmet_rdma_cm_handler: nvmet_rdma: established (9): status 0 id 0000000070c2f22c [ 612.157849] nvmet:nvmet_execute_io_connect: nvmet: adding queue 1 to ctrl 1. [ 612.158194] nvmet:nvmet_execute_io_connect: nvmet: adding queue 2 to ctrl 1. [ 612.158548] nvmet:nvmet_execute_io_connect: nvmet: adding queue 3 to ctrl 1. [ 612.158854] nvmet:nvmet_execute_io_connect: nvmet: adding queue 4 to ctrl 1. [ 612.159173] nvmet:nvmet_execute_io_connect: nvmet: adding queue 5 to ctrl 1. [ 612.159510] nvmet:nvmet_execute_io_connect: nvmet: adding queue 6 to ctrl 1. [ 612.159853] nvmet:nvmet_execute_io_connect: nvmet: adding queue 7 to ctrl 1. [ 612.160225] nvmet:nvmet_execute_io_connect: nvmet: adding queue 8 to ctrl 1. [ 612.160572] nvmet:nvmet_execute_io_connect: nvmet: adding queue 9 to ctrl 1. [ 612.160885] nvmet:nvmet_execute_io_connect: nvmet: adding queue 10 to ctrl 1. [ 612.161216] nvmet:nvmet_execute_io_connect: nvmet: adding queue 11 to ctrl 1. [ 612.161759] nvmet:nvmet_execute_io_connect: nvmet: adding queue 12 to ctrl 1. [ 612.162123] nvmet:nvmet_execute_io_connect: nvmet: adding queue 13 to ctrl 1. [ 612.162473] nvmet:nvmet_execute_io_connect: nvmet: adding queue 14 to ctrl 1. [ 612.162793] nvmet:nvmet_execute_io_connect: nvmet: adding queue 15 to ctrl 1. [ 612.163135] nvmet:nvmet_execute_io_connect: nvmet: adding queue 16 to ctrl 1. [ 612.163458] nvmet:nvmet_execute_io_connect: nvmet: adding queue 17 to ctrl 1. [ 612.163776] nvmet:nvmet_execute_io_connect: nvmet: adding queue 18 to ctrl 1. [ 612.164103] nvmet:nvmet_execute_io_connect: nvmet: adding queue 19 to ctrl 1. [ 612.164417] nvmet:nvmet_execute_io_connect: nvmet: adding queue 20 to ctrl 1. [ 612.164700] nvmet:nvmet_execute_io_connect: nvmet: adding queue 21 to ctrl 1. [ 612.164971] nvmet:nvmet_execute_io_connect: nvmet: adding queue 22 to ctrl 1. [ 612.165275] nvmet:nvmet_execute_io_connect: nvmet: adding queue 23 to ctrl 1. [ 612.165689] nvmet:nvmet_execute_io_connect: nvmet: adding queue 24 to ctrl 1. [ 612.166082] nvmet:nvmet_execute_io_connect: nvmet: adding queue 25 to ctrl 1. [ 612.166383] nvmet:nvmet_execute_io_connect: nvmet: adding queue 26 to ctrl 1. [ 612.166663] nvmet:nvmet_execute_io_connect: nvmet: adding queue 27 to ctrl 1. [ 612.166963] nvmet:nvmet_execute_io_connect: nvmet: adding queue 28 to ctrl 1. [ 612.167273] nvmet:nvmet_execute_io_connect: nvmet: adding queue 29 to ctrl 1. [ 612.167529] nvmet:nvmet_execute_io_connect: nvmet: adding queue 30 to ctrl 1. [ 612.167804] nvmet:nvmet_execute_io_connect: nvmet: adding queue 31 to ctrl 1. [ 612.168114] nvmet:nvmet_execute_io_connect: nvmet: adding queue 32 to ctrl 1. [ 612.168523] nvme nvme0: new ctrl: NQN "nvme-test", addr 10.16.221.116:7777 [ 612.293734] device-mapper: multipath service-time: version 0.3.0 loaded [ 612.296099] device-mapper: table: 253:3: multipath: error getting device (-EBUSY) [ 612.303676] device-mapper: ioctl: error adding target to table [ 612.501014] device-mapper: table: 253:3: multipath: error getting device (-EBUSY) [ 612.508751] device-mapper: ioctl: error adding target to table [ 612.514717] device-mapper: table: 253:3: multipath: error getting device (-EBUSY) [ 612.522632] device-mapper: ioctl: error adding target to table [ 612.950887] device-mapper: table: 253:3: multipath: error getting device (-EBUSY) [ 612.958616] device-mapper: ioctl: error adding target to table [ 613.869348] EXT4-fs (dm-3): mounted filesystem without journal. Opts: (null). Quota mode: none. [ 613.869619] ext4 filesystem being mounted at /root/kernel-tests/storage/blktests/nvme/nvmeof-mp/blktests/results/tmpdir.nvmeof-mp.002.ZYO/mnt0 supports timestamps until 2038 (0x7fffffff) [ 617.144570] nvmet:nvmet_keep_alive_timer: nvmet: ctrl 1 reschedule traffic based keep-alive timer [ 617.272414] nvmet:nvmet_execute_keep_alive: nvmet: ctrl 1 update keep-alive timer for 5 secs [ 622.775724] nvmet:nvmet_keep_alive_timer: nvmet: ctrl 1 reschedule traffic based keep-alive timer [ 627.895810] nvmet:nvmet_keep_alive_timer: nvmet: ctrl 1 reschedule traffic based keep-alive timer [ 633.015369] nvmet:nvmet_keep_alive_timer: nvmet: ctrl 1 reschedule traffic based keep-alive timer [ 638.138209] nvmet:nvmet_keep_alive_timer: nvmet: ctrl 1 reschedule traffic based keep-alive timer [ 643.254966] nvmet:nvmet_keep_alive_timer: nvmet: ctrl 1 reschedule traffic based keep-alive timer [ 644.822178] eno2 speed is unknown, defaulting to 1000 [ 644.827494] eno3 speed is unknown, defaulting to 1000 [ 644.832797] eno4 speed is unknown, defaulting to 1000 [ 648.377790] nvmet:nvmet_keep_alive_timer: nvmet: ctrl 1 reschedule traffic based keep-alive timer [ 651.364570] nvme nvme0: Removing ctrl: NQN "nvme-test" [ 651.430891] nvmet_rdma:nvmet_rdma_cm_handler: nvmet_rdma: disconnected (10): status 0 id 000000009ae8bc67 [ 651.430914] nvmet_rdma:__nvmet_rdma_queue_disconnect: nvmet_rdma: cm_id= 000000009ae8bc67 queue->state= 1 [ 651.430983] nvmet_rdma:nvmet_rdma_cm_handler: nvmet_rdma: disconnected (10): status 0 id 0000000078827089 [ 651.430994] nvmet_rdma:__nvmet_rdma_queue_disconnect: nvmet_rdma: cm_id= 0000000078827089 queue->state= 1 --snip-- [ 651.438145] nvme_rdma:nvme_rdma_cm_handler: nvme nvme0: disconnected (10): status 0 id 00000000cede4b76 [ 651.438150] nvme_rdma:nvme_rdma_cm_handler: nvme nvme0: disconnect received - connection closed [ 651.448883] ====================================================== [ 651.455061] WARNING: possible circular locking dependency detected [ 651.461242] 5.16.0-rc3+ #2 Tainted: G S I [ 651.466380] ------------------------------------------------------ [ 651.472559] kworker/14:4/4839 is trying to acquire lock: [ 651.477873] ffff8884c0e353e0 (&id_priv->handler_mutex){+.+.}-{3:3}, at: rdma_destroy_id+0x17/0x20 [rdma_cm] [ 651.487623] but task is already holding lock: [ 651.493453] ffff8883cc21fe00 ((work_completion)(&queue->release_work)){+.+.}-{0:0}, at: process_one_work+0x888/0x1740 [ 651.504062] which lock already depends on the new lock. [ 651.512233] the existing dependency chain (in reverse order) is: [ 651.519715] -> #3 ((work_completion)(&queue->release_work)){+.+.}-{0:0}: [ 651.527894] lock_acquire+0x1d2/0x5a0 [ 651.532079] process_one_work+0x8da/0x1740 [ 651.536700] worker_thread+0x87/0xbf0 [ 651.540886] kthread+0x3aa/0x490 [ 651.544637] ret_from_fork+0x1f/0x30 [ 651.548737] -> #2 ((wq_completion)events){+.+.}-{0:0}: [ 651.555349] lock_acquire+0x1d2/0x5a0 [ 651.559535] flush_workqueue+0x109/0x15f0 [ 651.564067] nvmet_rdma_queue_connect+0x1c79/0x2610 [nvmet_rdma] [ 651.570594] cma_cm_event_handler+0xef/0x530 [rdma_cm] [ 651.576261] cma_ib_req_handler+0x1a76/0x4a50 [rdma_cm] [ 651.582016] cm_process_work+0x44/0x190 [ib_cm] [ 651.587077] cm_req_handler+0x2218/0x7040 [ib_cm] [ 651.592312] cm_work_handler+0x1acf/0x4a30 [ib_cm] [ 651.597633] process_one_work+0x978/0x1740 [ 651.602253] worker_thread+0x87/0xbf0 [ 651.606437] kthread+0x3aa/0x490 [ 651.610190] ret_from_fork+0x1f/0x30 [ 651.614290] -> #1 (&id_priv->handler_mutex/1){+.+.}-{3:3}: [ 651.621249] lock_acquire+0x1d2/0x5a0 [ 651.625435] __mutex_lock+0x155/0x15f0 [ 651.629708] cma_ib_req_handler+0x1709/0x4a50 [rdma_cm] [ 651.635462] cm_process_work+0x44/0x190 [ib_cm] [ 651.640521] cm_req_handler+0x2218/0x7040 [ib_cm] [ 651.645749] cm_work_handler+0x1acf/0x4a30 [ib_cm] [ 651.651071] process_one_work+0x978/0x1740 [ 651.655689] worker_thread+0x87/0xbf0 [ 651.659874] kthread+0x3aa/0x490 [ 651.663628] ret_from_fork+0x1f/0x30 [ 651.667726] -> #0 (&id_priv->handler_mutex){+.+.}-{3:3}: [ 651.674511] check_prevs_add+0x3fd/0x2480 [ 651.679045] __lock_acquire+0x23f7/0x2f80 [ 651.683577] lock_acquire+0x1d2/0x5a0 [ 651.687762] __mutex_lock+0x155/0x15f0 [ 651.692036] rdma_destroy_id+0x17/0x20 [rdma_cm] [ 651.697184] nvmet_rdma_free_queue+0x7a/0x380 [nvmet_rdma] [ 651.703189] nvmet_rdma_release_queue_work+0x3b/0x90 [nvmet_rdma] [ 651.709802] process_one_work+0x978/0x1740 [ 651.714422] worker_thread+0x87/0xbf0 [ 651.718608] kthread+0x3aa/0x490 [ 651.722358] ret_from_fork+0x1f/0x30 [ 651.726459] other info that might help us debug this: [ 651.734457] Chain exists of: &id_priv->handler_mutex --> (wq_completion)events --> (work_completion)(&queue->release_work) [ 651.748531] Possible unsafe locking scenario: [ 651.754451] CPU0 CPU1 [ 651.758984] ---- ---- [ 651.763515] lock((work_completion)(&queue->release_work)); [ 651.769174] lock((wq_completion)events); [ 651.775786] lock((work_completion)(&queue->release_work)); [ 651.783959] lock(&id_priv->handler_mutex); [ 651.788233] *** DEADLOCK *** [ 651.794153] 2 locks held by kworker/14:4/4839: [ 651.798596] #0: ffff88810005b948 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x854/0x1740 [ 651.808078] #1: ffff8883cc21fe00 ((work_completion)(&queue->release_work)){+.+.}-{0:0}, at: process_one_work+0x888/0x1740 [ 651.819121] stack backtrace: [ 651.823478] CPU: 14 PID: 4839 Comm: kworker/14:4 Tainted: G S I 5.16.0-rc3+ #2 [ 651.831824] Hardware name: Dell Inc. PowerEdge R640/06NR82, BIOS 2.11.2 004/21/2021 [ 651.839476] Workqueue: events nvmet_rdma_release_queue_work [nvmet_rdma] [ 651.846175] Call Trace: [ 651.848628] <TASK> [ 651.850735] dump_stack_lvl+0x44/0x57 [ 651.854401] check_noncircular+0x280/0x320 [ 651.858501] ? print_circular_bug.isra.47+0x430/0x430 [ 651.863553] ? lock_chain_count+0x20/0x20 [ 651.867566] ? wait_for_completion+0x1e0/0x270 [ 651.872010] ? mark_lock.part.52+0xf7/0x1050 [ 651.876283] ? sched_clock_cpu+0x15/0x200 [ 651.880295] ? find_held_lock+0x3a/0x1c0 [ 651.884223] check_prevs_add+0x3fd/0x2480 [ 651.888238] ? mark_held_locks+0xb7/0x120 [ 651.892248] ? check_irq_usage+0xb50/0xb50 [ 651.896346] ? lockdep_lock+0xcb/0x1c0 [ 651.900100] ? static_obj+0xc0/0xc0 [ 651.903591] ? sched_clock_cpu+0x15/0x200 [ 651.907606] __lock_acquire+0x23f7/0x2f80 [ 651.911620] ? rcu_read_lock_bh_held+0xc0/0xc0 [ 651.916063] lock_acquire+0x1d2/0x5a0 [ 651.919729] ? rdma_destroy_id+0x17/0x20 [rdma_cm] [ 651.924529] ? rcu_read_unlock+0x50/0x50 [ 651.928456] ? lock_is_held_type+0xd9/0x130 [ 651.932644] __mutex_lock+0x155/0x15f0 [ 651.936393] ? rdma_destroy_id+0x17/0x20 [rdma_cm] [ 651.941194] ? kasan_quarantine_put+0xa4/0x180 [ 651.945641] ? rdma_destroy_id+0x17/0x20 [rdma_cm] [ 651.950442] ? mutex_lock_io_nested+0x1460/0x1460 [ 651.955146] ? lock_is_held_type+0xd9/0x130 [ 651.959334] ? rcu_read_lock_sched_held+0xaf/0xe0 [ 651.964039] ? rcu_read_lock_bh_held+0xc0/0xc0 [ 651.968487] ? rdma_destroy_id+0x17/0x20 [rdma_cm] [ 651.973287] rdma_destroy_id+0x17/0x20 [rdma_cm] [ 651.977915] nvmet_rdma_free_queue+0x7a/0x380 [nvmet_rdma] [ 651.983401] nvmet_rdma_release_queue_work+0x3b/0x90 [nvmet_rdma] [ 651.989493] process_one_work+0x978/0x1740 [ 651.993595] ? pwq_dec_nr_in_flight+0x270/0x270 [ 651.998126] worker_thread+0x87/0xbf0 [ 652.001790] ? __kthread_parkme+0xc3/0x1d0 [ 652.005890] ? process_one_work+0x1740/0x1740 [ 652.010248] kthread+0x3aa/0x490 [ 652.013480] ? set_kthread_struct+0x100/0x100 [ 652.017841] ret_from_fork+0x1f/0x30 [ 652.021424] </TASK> [ 652.050739] nvmet_rdma:nvmet_rdma_cm_handler: nvmet_rdma: disconnected (10): status 0 id 000000007ccc375a [ 652.060318] nvmet_rdma:__nvmet_rdma_queue_disconnect: nvmet_rdma: cm_id= 000000007ccc375a queue->state= 1 [ 652.069937] nvmet_rdma:nvmet_rdma_free_queue: nvmet_rdma: freeing queue 0 [ 652.082650] nvmet:nvmet_stop_keep_alive_timer: nvmet: ctrl 1 stop keep-alive [ 652.488476] eno2 speed is unknown, defaulting to 1000 [ 652.493657] eno3 speed is unknown, defaulting to 1000 [ 652.498847] eno4 speed is unknown, defaulting to 1000 [ 652.558751] rdma_rxe: rxe-ah pool destroyed with unfree'd elem [ 652.678415] rdma_rxe: unloaded