We ran this patch series through our regression and NVMe/FC boot from SAN tests. (I think 4/5 will also resolve some instability / race condition with kickstart, since the connect is now synchronous, but we may need some more time to full validate that. I was going to pursue a userspace solution but this may be good enough.) Notwithstanding the other comments re: error codes, etc. feel free to add: Reviewed-by: Ewan D. Milne <emilne@xxxxxxxxxx> Tested-by: Marco Patalano (mpatalan@xxxxxxxxxx> On Tue, Jun 20, 2023 at 9:39 AM Daniel Wagner <dwagner@xxxxxxx> wrote: > > With the additinal 3 patches and the blktests changes [1] I am able to run the > tests in a loop for a while without failures. I expect there are more problems > hidden but with these changes the first milesone is reached. > > The first new patch is actually one from a previous attempt. It addresses the > problem in unloading path where we dont make any progress more [2]. > > The next two patches change the existing initial connection attempt of FC to > a synchronous one. Without this the auth test fail [3]. > > Daniel > > > [1] https://lore.kernel.org/linux-nvme/20230620132703.20648-1-dwagner@xxxxxxx/ > [2] https://lore.kernel.org/linux-nvme/20230418130159.11075-1-dwagner@xxxxxxx/ > [3] https://lore.kernel.org/linux-nvme/j4w7724skjsapgtp6wtll5fgsb4adhpfdtrif52lhc7iql4inf@3mu2gbjrrtcs/ > > changes: > > v2: > - added RBs > - added new patches > > v1: > - https://lore.kernel.org/linux-nvme/20230615094356.14878-1-dwagner@xxxxxxx/ > > > Initial cover letter: > > A couple more fixes to enable blktests for the fc transport. > > 1) Another module unloading hanger which can be triggered with > the nvme/048 tests: > > run blktests nvme/048 at 2023-06-06 13:04:49 > nvmet: adding nsid 1 to subsystem blktests-subsystem-1 > nvme nvme2: NVME-FC{0}: create association : host wwpn 0x20001100aa000002 rport wwpn 0x20001100aa000001: NQN "blktests-subsystem-1" > (NULL device *): {0:0} Association created > [478] nvmet: ctrl 1 start keep-alive timer for 1 secs > [478] nvmet: check nqn.2014-08.org.nvmexpress:uuid:3d0c3f5d-cb37-4bc4-af29-2168953bfc0a > [478] nvmet: nvmet_setup_dhgroup: ctrl 1 selecting dhgroup 0 > [478] nvmet: No authentication provided > nvmet: creating nvm controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:3d0c3f5d-cb37-4bc4-af29-2168953bfc0a. > [478] nvmet: adding queue 1 to ctrl 1. > [407] nvmet: adding queue 2 to ctrl 1. > [6549] nvmet: adding queue 3 to ctrl 1. > [1760] nvmet: adding queue 4 to ctrl 1. > nvme nvme2: NVME-FC{0}: controller connect complete > nvme nvme2: NVME-FC{0}: new ctrl: NQN "blktests-subsystem-1" > [478] nvmet: ctrl 1 reschedule traffic based keep-alive timer > [407] nvmet: ctrl 1 update keep-alive timer for 1 secs > nvme nvme2: NVME-FC{0}: io failed due to lldd error 6 > nvme nvme2: NVME-FC{0}: transport association event: transport detected io error > nvme nvme2: NVME-FC{0}: resetting controller > [478] nvmet: ctrl 1 stop keep-alive > (NULL device *): {0:0} Association deleted > nvme nvme2: NVME-FC{0}: create association : host wwpn 0x20001100aa000002 rport wwpn 0x20001100aa000001: NQN "blktests-subsystem-1" > (NULL device *): {0:0} Association freed > (NULL device *): {0:0} Association created > (NULL device *): Disconnect LS failed: No Association > [407] nvmet: ctrl 1 start keep-alive timer for 1 secs > [407] nvmet: check nqn.2014-08.org.nvmexpress:uuid:3d0c3f5d-cb37-4bc4-af29-2168953bfc0a > [407] nvmet: nvmet_setup_dhgroup: ctrl 1 selecting dhgroup 0 > [407] nvmet: No authentication provided > nvmet: creating nvm controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:3d0c3f5d-cb37-4bc4-af29-2168953bfc0a. > nvme nvme2: reconnect: revising io queue count from 4 to 1 > [478] nvmet: adding queue 1 to ctrl 1. > nvme nvme2: NVME-FC{0}: controller connect complete > [478] nvmet: ctrl 1 reschedule traffic based keep-alive timer > [478] nvmet: ctrl 1 update keep-alive timer for 1 secs > [478] nvmet: ctrl 1 update keep-alive timer for 1 secs > [6549] nvmet: ctrl 1 update keep-alive timer for 1 secs > [6549] nvmet: ctrl 1 update keep-alive timer for 1 secs > [6549] nvmet: ctrl 1 update keep-alive timer for 1 secs > [6549] nvmet: ctrl 1 update keep-alive timer for 1 secs > [1760] nvmet: ctrl 1 update keep-alive timer for 1 secs > [1760] nvmet: ctrl 1 update keep-alive timer for 1 secs > [1760] nvmet: ctrl 1 update keep-alive timer for 1 secs > [1760] nvmet: ctrl 1 update keep-alive timer for 1 secs > [1760] nvmet: ctrl 1 update keep-alive timer for 1 secs > [1760] nvmet: ctrl 1 update keep-alive timer for 1 secs > [407] nvmet: ctrl 1 update keep-alive timer for 1 secs > nvme nvme2: NVME-FC{0}: io failed due to lldd error 6 > nvme nvme2: NVME-FC{0}: transport association event: transport detected io error > nvme nvme2: NVME-FC{0}: resetting controller > [478] nvmet: ctrl 1 stop keep-alive > (NULL device *): {0:0} Association deleted > (NULL device *): {0:0} Association freed > (NULL device *): Disconnect LS failed: No Association > nvme nvme2: NVME-FC{0}: create association : host wwpn 0x20001100aa000002 rport wwpn 0x20001100aa000001: NQN "blktests-subsystem-1" > (NULL device *): {0:0} Association created > [1760] nvmet: ctrl 1 start keep-alive timer for 1 secs > [1760] nvmet: check nqn.2014-08.org.nvmexpress:uuid:3d0c3f5d-cb37-4bc4-af29-2168953bfc0a > [1760] nvmet: nvmet_setup_dhgroup: ctrl 1 selecting dhgroup 0 > [1760] nvmet: No authentication provided > nvmet: creating nvm controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:3d0c3f5d-cb37-4bc4-af29-2168953bfc0a. > nvme nvme2: reconnect: revising io queue count from 1 to 4 > [478] nvmet: adding queue 1 to ctrl 1. > [407] nvmet: adding queue 2 to ctrl 1. > [6549] nvmet: adding queue 3 to ctrl 1. > [1760] nvmet: adding queue 4 to ctrl 1. > nvme nvme2: NVME-FC{0}: controller connect complete > [1760] nvmet: ctrl 1 reschedule traffic based keep-alive timer > [478] nvmet: ctrl 1 update keep-alive timer for 1 secs > [478] nvmet: ctrl 1 update keep-alive timer for 1 secs > [478] nvmet: ctrl 1 update keep-alive timer for 1 secs > [478] nvmet: ctrl 1 update keep-alive timer for 1 secs > [1760] nvmet: ctrl 1 update keep-alive timer for 1 secs > [6549] nvmet: ctrl 1 update keep-alive timer for 1 secs > [1760] nvmet: ctrl 1 update keep-alive timer for 1 secs > [6549] nvmet: ctrl 1 update keep-alive timer for 1 secs > [1760] nvmet: ctrl 1 update keep-alive timer for 1 secs > [1760] nvmet: ctrl 1 update keep-alive timer for 1 secs > [6549] nvmet: ctrl 1 update keep-alive timer for 1 secs > [1760] nvmet: ctrl 1 update keep-alive timer for 1 secs > [6549] nvmet: ctrl 1 update keep-alive timer for 1 secs > nvme nvme2: Removing ctrl: NQN "blktests-subsystem-1" > [407] nvmet: ctrl 1 stop keep-alive > (NULL device *): {0:0} Association deleted > (NULL device *): {0:0} Association freed > (NULL device *): Disconnect LS failed: No Association > nvme_fc: nvme_fc_exit_module: waiting for ctlr deletes > BTRFS info (device vda2): qgroup scan completed (inconsistency flag cleared) > INFO: task modprobe:11758 blocked for more than 491 seconds. > Tainted: G W 6.4.0-rc2+ #3 > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > task:modprobe state:D stack:0 pid:11758 ppid:11585 flags:0x00004002 > Call Trace: > <TASK> > __schedule+0x17b5/0x4870 > ? finish_lock_switch+0x8f/0x120 > ? finish_task_switch+0x17f/0x5c0 > ? __sched_text_start+0x10/0x10 > ? __schedule+0x17bd/0x4870 > ? do_raw_spin_trylock+0xc9/0x1f0 > schedule+0xe6/0x1e0 > schedule_timeout+0x7e/0x1e0 > ? __cfi_schedule_timeout+0x10/0x10 > ? do_raw_spin_trylock+0xc9/0x1f0 > ? __cfi_lock_release+0x10/0x10 > ? do_raw_spin_unlock+0x116/0x8a0 > wait_for_common+0x377/0x600 > ? wait_for_completion+0x30/0x30 > cleanup_module+0x222/0x240 [nvme_fc bda1ef6f83d00208ba257c5d0ac34dd6bdb58bf9] > __se_sys_delete_module+0x388/0x550 > ? __x64_sys_delete_module+0x50/0x50 > ? task_work_run+0x236/0x290 > ? syscall_enter_from_user_mode+0x2e/0x210 > do_syscall_64+0x6e/0xa0 > ? syscall_exit_to_user_mode+0x5e/0x220 > ? do_syscall_64+0x7d/0xa0 > ? syscall_exit_to_user_mode+0x5e/0x220 > ? do_syscall_64+0x7d/0xa0 > ? syscall_exit_to_user_mode+0x5e/0x220 > ? do_syscall_64+0x7d/0xa0 > entry_SYSCALL_64_after_hwframe+0x72/0xdc > RIP: 0033:0x7fa80811aebb > RSP: 002b:00007fff80ea0a88 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0 > RAX: ffffffffffffffda RBX: 000055a0b5acf1f0 RCX: 00007fa80811aebb > RDX: 0000000000000000 RSI: 0000000000000800 RDI: 000055a0b5acf258 > RBP: 0000000000000000 R08: 1999999999999999 R09: 0000000000000000 > R10: 00007fa808193ac0 R11: 0000000000000206 R12: 00007fff80ea0ad0 > R13: 000055a0b5acf1f0 R14: 0000000000000000 R15: 0000000000000000 > </TASK> > > > 2) When executing blktests nvme/030 in a tight loop, I was able to > reproduce a different hanger. In this case the ->done() function > was never executed on the host fc side. It turns out we didn't > enqueue work items correctly and thus would only process one > work item. > > run blktests nvme/030 at 2023-06-13 14:03:52 > nvmet: adding nsid 1 to subsystem blktests-subsystem-1 > nvme nvme2: NVME-FC{0}: create association : host wwpn 0x20001100aa000002 rport wwpn 0x20001100aa000001: NQN "nqn.2014-08.org.nvmexpress.discovery" > (NULL device *): {0:0} Association created > [23733] nvmet: ctrl 1 start keep-alive timer for 120 secs > nvmet: creating discovery controller 1 for subsystem nqn.2014-08.org.nvmexpress.discovery for NQN nqn.2014-08.org.nvmexpress:uuid:242d4a24-2484-4a80-8234-d0169409c5e8. > nvme nvme2: NVME-FC{0}: controller connect complete > nvme nvme2: NVME-FC{0}: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery" > nvme nvme2: Removing ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery" > [23771] nvmet: ctrl 1 stop keep-alive > nvmet: adding nsid 1 to subsystem blktests-subsystem-2 > (NULL device *): {0:0} Association deleted > (NULL device *): {0:0} Association freed > (NULL device *): Disconnect LS failed: No Association > nvme nvme2: NVME-FC{0}: create association : host wwpn 0x20001100aa000002 rport wwpn 0x20001100aa000001: NQN "nqn.2014-08.org.nvmexpress.discovery" > (NULL device *): {0:0} Association created > [27320] nvmet: ctrl 1 start keep-alive timer for 120 secs > nvmet: creating discovery controller 1 for subsystem nqn.2014-08.org.nvmexpress.discovery for NQN nqn.2014-08.org.nvmexpress:uuid:242d4a24-2484-4a80-8234-d0169409c5e8. > nvme nvme2: NVME-FC{0}: controller connect complete > nvme nvme2: NVME-FC{0}: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery" > nvme nvme2: Removing ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery" > [27320] nvmet: ctrl 1 stop keep-alive > nvme nvme2: NVME-FC{0}: create association : host wwpn 0x20001100aa000002 rport wwpn 0x20001100aa000001: NQN "nqn.2014-08.org.nvmexpress.discovery" > (NULL device *): {0:1} Association created > (NULL device *): {0:0} Association deleted > (NULL device *): {0:0} Association freed > (NULL device *): Disconnect LS failed: No Association > INFO: task kworker/u8:4:102 blocked for more than 491 seconds. > Tainted: G W 6.4.0-rc2+ #3 > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > task:kworker/u8:4 state:D stack:0 pid:102 ppid:2 flags:0x00004000 > Workqueue: nvme-wq nvme_fc_connect_ctrl_work [nvme_fc] > Call Trace: > <TASK> > __schedule+0x17b5/0x4870 > ? try_to_wake_up+0xb5/0xfc0 > ? __cfi_lock_release+0x10/0x10 > ? __sched_text_start+0x10/0x10 > ? _raw_spin_unlock_irqrestore+0x24/0x50 > ? lock_release+0x2aa/0xd10 > ? wq_worker_sleeping+0x1e/0x200 > schedule+0xe6/0x1e0 > schedule_timeout+0x7e/0x1e0 > ? __cfi_schedule_timeout+0x10/0x10 > ? mark_lock+0x94/0x340 > ? lockdep_hardirqs_on_prepare+0x2aa/0x5e0 > wait_for_common+0x377/0x600 > ? queue_work_on+0x57/0xa0 > ? wait_for_completion+0x30/0x30 > nvme_fc_connect_ctrl_work+0x7dc/0x1a80 [nvme_fc 39e2bf78272df3a610fb1f9884513e99038af746] > process_one_work+0x80f/0xfb0 > ? rescuer_thread+0x1130/0x1130 > ? do_raw_spin_trylock+0xc9/0x1f0 > ? lock_acquired+0x310/0x9a0 > ? worker_thread+0xd5e/0x1260 > worker_thread+0xbde/0x1260 > kthread+0x25d/0x2f0 > ? __cfi_worker_thread+0x10/0x10 > ? __cfi_kthread+0x10/0x10 > ret_from_fork+0x29/0x50 > </TASK> > > Daniel Wagner (5): > nvme-fc: Do not wait in vain when unloading module > nvme-fcloop: queue work items correctly > nvmet-fcloop: Remove remote port from list when unlinking > nvme-fc: Make initial connect attempt synchronous > nvme-fc: do no free ctrl opts > > drivers/nvme/host/fc.c | 43 ++++++++++++++++++------------------ > drivers/nvme/target/fcloop.c | 15 +++++-------- > 2 files changed, 27 insertions(+), 31 deletions(-) > > -- > 2.41.0 > >