Hi all, (I added nbd list to the To list since the nbd test case failure is reported) I ran the latest blktests (git hash: 607513e64e48) with the v6.9-rc1 kernel, and four failures were observed. I also checked the CKI project blktests runs with the v6.9-rc1 kernel and found one failure. In total, five failure symptoms were observed, as listed below. One notable difference from the v6.8 kernel [1] is block/011. The test case failure with NVME devices is no longer observed. However, exactly the same failure symptom is now observed with the other test case nvme/050. I guess some of the kernel changes between v6.8 and v6.9-rc1 caused this symptom change. [1] https://lore.kernel.org/linux-block/nnphuvhpwdqjwi3mdisom7iuj2qnxkf4ldzzzfzy63bfet6mas@2jcd4jzzzteu/ List of failures ================ #1: nvme/041,044 (fc transport) #2: nvme/050 #3: srp/002, 011 (rdma_rxe driver) #4: nbd/002 #5: zbd/010 (CKI failure) Failure description =================== #1: nvme/041,044 (fc transport) With the trtype=fc configuration, nvme/041 and 044 fail with similar error messages: nvme/041 (Create authenticated connections) [failed] runtime 2.677s ... 4.823s --- tests/nvme/041.out 2023-11-29 12:57:17.206898664 +0900 +++ /home/shin/Blktests/blktests/results/nodev/nvme/041.out.bad 2024-03-19 14:50:56.399101323 +0900 @@ -2,5 +2,5 @@ Test unauthenticated connection (should fail) disconnected 0 controller(s) Test authenticated connection -disconnected 1 controller(s) +disconnected 0 controller(s) Test complete nvme/044 (Test bi-directional authentication) [failed] runtime 4.740s ... 7.482s --- tests/nvme/044.out 2023-11-29 12:57:17.212898647 +0900 +++ /home/shin/Blktests/blktests/results/nodev/nvme/044.out.bad 2024-03-19 14:51:08.062067741 +0900 @@ -4,7 +4,7 @@ Test invalid ctrl authentication (should fail) disconnected 0 controller(s) Test valid ctrl authentication -disconnected 1 controller(s) +disconnected 0 controller(s) Test invalid ctrl key (should fail) disconnected 0 controller(s) ... (Run 'diff -u tests/nvme/044.out /home/shin/Blktests/blktests/results/nodev/nvme/044.out.bad' to see the entire diff) A solution was suggested and discussed in Feb/2024 [2]. [2] https://lore.kernel.org/linux-nvme/20240221132404.6311-1-dwagner@xxxxxxx/ #2: nvme/050 The test case fails occasionally with a QEMU NVME device. The failure cause is the lockdep WARN among ctrl->namespaces_rwsem, dev->shutdown_lock and workqueue work completion [9]. This WARN is the same as it was observed with block/011 for the kernel versions from v6.0 to v6.8, which was reported in Sep/2022 [3]. [3] https://lore.kernel.org/linux-block/20220930001943.zdbvolc3gkekfmcv@shindev/ #3: srp/002, 011 (rdma_rxe driver) Test process hang is observed occasionally. Reported to the relevant mailing lists in Aug/2023 [4]. Blktests was modified to change the default driver from rdma_rxe to siw to avoid impacts on blktests users. The root cause is not yet understood. [4] https://lore.kernel.org/linux-rdma/18a3ae8c-145b-4c7f-a8f5-67840feeb98c@xxxxxxx/T/#mee9882c2cfd0cfff33caa04e75418576f4c7a789 #4: nbd/002 The test case fails occasionally. CKI reported this failure for the v6.8 kernel also in the past. A blktests patch reduces the failure ratio, but does not avoid the failure completely [5]. Needs further investigation. nbd/002 (tests on partition handling for an nbd device) [failed] runtime ... 0.414s --- tests/nbd/002.out 2024-02-19 19:25:07.453721466 +0900 +++ /home/shin/kts/kernel-test-suite/sets/blktests/log/runlog/nodev/nbd/002.out.bad 2024-03-19 14:53:56.320218177 +0900 @@ -1,4 +1,4 @@ Running nbd/002 Testing IOCTL path Testing the netlink path -Test complete +Didn't have partition on the netlink path [5] https://lore.kernel.org/linux-block/lecgttcqtjsolotcowjvqwsyrphe7fucllkrzupc5guf2gbq5z@grmwusovo22b/ #5: zbd/010 (CKI failure) CKI observed the failure [6], and Yi Zhang reported it to relevant mailing lists [7]. Chao Yu provided a fix patch in F2FS [8]. [6] https://datawarehouse.cki-project.org/issue/2508 [7] https://lore.kernel.org/linux-f2fs-devel/CAHj4cs-kfojYC9i0G73PRkYzcxCTex=-vugRFeP40g_URGvnfQ@xxxxxxxxxxxxxx/ [8] https://lore.kernel.org/linux-f2fs-devel/20240327074223.2216487-1-chao@xxxxxxxxxx/ [9] [ 1907.056570] ====================================================== [ 1907.058993] WARNING: possible circular locking dependency detected [ 1907.061446] 6.9.0-rc1 #158 Not tainted [ 1907.063639] ------------------------------------------------------ [ 1907.066017] kworker/u16:3/2996 is trying to acquire lock: [ 1907.068423] ffff888120ab39b8 ((work_completion)(&q->timeout_work)){+.+.}-{0:0}, at: __flush_work+0x48f/0xa40 [ 1907.071311] but task is already holding lock: [ 1907.075601] ffff8881389f44d8 (&ctrl->namespaces_rwsem){++++}-{3:3}, at: nvme_sync_queues+0x26/0x100 [nvme_core] [ 1907.079089] which lock already depends on the new lock. [ 1907.085827] the existing dependency chain (in reverse order) is: [ 1907.090475] -> #2 (&ctrl->namespaces_rwsem){++++}-{3:3}: [ 1907.094962] down_read+0x9d/0x480 [ 1907.097356] nvme_start_freeze+0x3d/0xd0 [nvme_core] [ 1907.099900] nvme_dev_disable+0x57c/0x6b0 [nvme] [ 1907.102365] nvme_timeout+0x8a9/0x11f0 [nvme] [ 1907.104854] blk_mq_handle_expired+0x1a4/0x270 [ 1907.107065] bt_iter+0x22a/0x310 [ 1907.109108] blk_mq_queue_tag_busy_iter+0x809/0x1380 [ 1907.111374] blk_mq_timeout_work+0x3b9/0x550 [ 1907.113582] process_one_work+0x802/0x1720 [ 1907.115707] worker_thread+0x720/0x1300 [ 1907.117805] kthread+0x2ec/0x3d0 [ 1907.119851] ret_from_fork+0x30/0x70 [ 1907.121981] ret_from_fork_asm+0x1a/0x30 [ 1907.124167] -> #1 (&dev->shutdown_lock){+.+.}-{3:3}: [ 1907.128058] __mutex_lock+0x18d/0x1480 [ 1907.130117] nvme_dev_disable+0x9b/0x6b0 [nvme] [ 1907.132239] nvme_timeout+0x8a9/0x11f0 [nvme] [ 1907.134383] blk_mq_handle_expired+0x1a4/0x270 [ 1907.136500] bt_iter+0x22a/0x310 [ 1907.138393] blk_mq_queue_tag_busy_iter+0x809/0x1380 [ 1907.140515] blk_mq_timeout_work+0x3b9/0x550 [ 1907.142490] process_one_work+0x802/0x1720 [ 1907.144459] worker_thread+0x720/0x1300 [ 1907.146355] kthread+0x2ec/0x3d0 [ 1907.148235] ret_from_fork+0x30/0x70 [ 1907.150093] ret_from_fork_asm+0x1a/0x30 [ 1907.151741] -> #0 ((work_completion)(&q->timeout_work)){+.+.}-{0:0}: [ 1907.153987] __lock_acquire+0x2e98/0x5ee0 [ 1907.155190] lock_acquire+0x1ac/0x520 [ 1907.156396] __flush_work+0x4ac/0xa40 [ 1907.157483] __cancel_work_sync+0x2c5/0x3f0 [ 1907.158586] nvme_sync_queues+0x71/0x100 [nvme_core] [ 1907.159772] nvme_reset_work+0xaa/0x4a0 [nvme] [ 1907.160903] process_one_work+0x802/0x1720 [ 1907.161994] worker_thread+0x720/0x1300 [ 1907.163054] kthread+0x2ec/0x3d0 [ 1907.164175] ret_from_fork+0x30/0x70 [ 1907.165187] ret_from_fork_asm+0x1a/0x30 [ 1907.166199] other info that might help us debug this: [ 1907.168719] Chain exists of: (work_completion)(&q->timeout_work) --> &dev->shutdown_lock --> &ctrl->namespaces_rwsem [ 1907.171733] Possible unsafe locking scenario: [ 1907.173507] CPU0 CPU1 [ 1907.174512] ---- ---- [ 1907.175509] rlock(&ctrl->namespaces_rwsem); [ 1907.176490] lock(&dev->shutdown_lock); [ 1907.177643] lock(&ctrl->namespaces_rwsem); [ 1907.178784] lock((work_completion)(&q->timeout_work)); [ 1907.179817] *** DEADLOCK *** [ 1907.182067] 4 locks held by kworker/u16:3/2996: [ 1907.183055] #0: ffff88811a053148 ((wq_completion)nvme-reset-wq){+.+.}-{0:0}, at: process_one_work+0xf57/0x1720 [ 1907.184498] #1: ffff88810fd97d90 ((work_completion)(&dev->ctrl.reset_work)){+.+.}-{0:0}, at: process_one_work+0x77f/0x1720 [ 1907.186027] #2: ffff8881389f44d8 (&ctrl->namespaces_rwsem){++++}-{3:3}, at: nvme_sync_queues+0x26/0x100 [nvme_core] [ 1907.187564] #3: ffffffffb8e79a20 (rcu_read_lock){....}-{1:2}, at: __flush_work+0xf5/0xa40 [ 1907.188948] stack backtrace: [ 1907.190707] CPU: 3 PID: 2996 Comm: kworker/u16:3 Not tainted 6.9.0-rc1 #158 [ 1907.191954] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-1.fc39 04/01/2014 [ 1907.193367] Workqueue: nvme-reset-wq nvme_reset_work [nvme] [ 1907.194546] Call Trace: [ 1907.195486] <TASK> [ 1907.196380] dump_stack_lvl+0x73/0xb0 [ 1907.197398] check_noncircular+0x306/0x3f0 [ 1907.198435] ? __pfx_check_noncircular+0x10/0x10 [ 1907.199516] ? __pfx_lock_release+0x10/0x10 [ 1907.200576] ? lockdep_lock+0xca/0x1c0 [ 1907.201592] ? __pfx_lockdep_lock+0x10/0x10 [ 1907.202643] ? __lock_acquire+0xcba/0x5ee0 [ 1907.203707] __lock_acquire+0x2e98/0x5ee0 [ 1907.204768] ? __pfx___lock_acquire+0x10/0x10 [ 1907.205855] ? __pfx_mark_lock+0x10/0x10 [ 1907.206883] lock_acquire+0x1ac/0x520 [ 1907.207940] ? __flush_work+0x48f/0xa40 [ 1907.209032] ? __pfx_lock_acquire+0x10/0x10 [ 1907.210139] ? __pfx_lock_release+0x10/0x10 [ 1907.211193] ? lockdep_init_map_type+0x169/0x7c0 [ 1907.212324] ? mark_held_locks+0x96/0xe0 [ 1907.213349] ? __flush_work+0x48f/0xa40 [ 1907.214361] __flush_work+0x4ac/0xa40 [ 1907.215351] ? __flush_work+0x48f/0xa40 [ 1907.216397] ? __pfx___flush_work+0x10/0x10 [ 1907.217438] ? __pfx_mark_lock+0x10/0x10 [ 1907.218458] ? __pfx_wq_barrier_func+0x10/0x10 [ 1907.219544] ? mark_held_locks+0x96/0xe0 [ 1907.220569] __cancel_work_sync+0x2c5/0x3f0 [ 1907.221628] ? __pfx___cancel_work_sync+0x10/0x10 [ 1907.222703] ? __timer_delete_sync+0x17f/0x2a0 [ 1907.223777] ? __pfx_lock_acquire+0x10/0x10 [ 1907.224805] nvme_sync_queues+0x71/0x100 [nvme_core] [ 1907.225898] nvme_reset_work+0xaa/0x4a0 [nvme] [ 1907.226949] process_one_work+0x802/0x1720 [ 1907.228003] ? __pfx_lock_acquire+0x10/0x10 [ 1907.229043] ? __pfx_process_one_work+0x10/0x10 [ 1907.230089] ? assign_work+0x16c/0x240 [ 1907.231090] worker_thread+0x720/0x1300 [ 1907.232145] ? __pfx_worker_thread+0x10/0x10 [ 1907.233183] kthread+0x2ec/0x3d0 [ 1907.234144] ? _raw_spin_unlock_irq+0x24/0x50 [ 1907.235199] ? __pfx_kthread+0x10/0x10 [ 1907.236265] ret_from_fork+0x30/0x70 [ 1907.237223] ? __pfx_kthread+0x10/0x10 [ 1907.238177] ret_from_fork_asm+0x1a/0x30 [ 1907.239126] </TASK>