Hi all, I ran the latest blktests (git hash: c1da4a225273) with v6.3 kernel and observed test case failures. I call for support to fix them. (I myself is now working on some of the block/011 failure symptoms.) List of failures ================ #1: block/011 #2: block/024 #3: nvme/003 (fabrics transport) #4: nvme/030 or nvme/031 (rdma transport with siw) #5: nvme/* (fc transport) This list excludes failures already resovled: scsi/007 and nvme/{044,045}. Failure description =================== #1: block/011 This test case shows three failure symptoms. Symptom A: The test case fails with NVME devices due to lockdep WARNING "possible circular locking dependency detected". Reported in Sep/2022 [1] and solution was discussed. Waiting a fix. [1] https://lore.kernel.org/linux-block/20220930001943.zdbvolc3gkekfmcv@shindev/ Symptom B: When this test case passes for a NVME device, the device is left with zero capacity occasionally. It causes following test cases fail. The device capacity change happens due to ENODEV during the test. I'm preparing a fix patch for blktests [2]. [2] https://github.com/kawasaki/blktests/commit/f78ceec0cb95b9b2a9dcbf44999e350b740722e8 Symptom C: When system disk and the test device belong to same PCI device, the system disk disappears during the test case run (e.g. a system with two SSDs on same AHCI controller or HBA). This results in test system hang. I'm preparing patches for blktests to avoid this failure [3]. [3] https://github.com/kawasaki/blktests/commit/662c33048ba1e949348dc2fca8b4778f940b7cc5 #2: block/024 Fails on slow machines. Reported in Dec/2022. Test case side issue is suspected. Still need further investigation. block/024 (do I/O faster than a jiffy and check iostats times) [failed] runtime ... 4.347s --- tests/block/024.out 2022-12-06 20:51:41.525066605 +0900 +++ /home/shin/kts/kernel-test-suite/sets/blktests/log/runlog/nodev/block/024.out.bad 2022-12-07 12:51:03.610924521 +0900 @@ -6,5 +6,5 @@ read 1 s write 1 s read 2 s -write 3 s +write 4 s Test complete #3: nvme/003 (fabrics transport) When nvme test group is run with trtype=rdma, tcp or fc, the test case fails due to lockdep WARNING "possible circular locking dependency detected". Detailed kernel message is attached [4]. #4: nvme/030 or nvme/031 (rdma transport with siw) When nvme test group is run with trtype=rdma and use_siw=1 configurations, a failure is observed at nvme/030 or nvme/031 occasionally. The failure cause is "BUG: KASAN: slab-use-after-free in __mutex_lock". Detailed kernel message is attached [5] #5: nvme/* (fc transport) With trtype=fc configuration, many of test cases in nvme test group fail or hang. Daniel Wagner is driving fix work [6]. [6] https://lore.kernel.org/linux-nvme/20230418130159.11075-1-dwagner@xxxxxxx/ [4] [ 134.432456] ====================================================== [ 134.433289] WARNING: possible circular locking dependency detected [ 134.433806] 6.3.0 #62 Not tainted [ 134.434149] ------------------------------------------------------ [ 134.434709] kworker/1:3/455 is trying to acquire lock: [ 134.435141] ffff88813bfd9420 (&id_priv->handler_mutex){+.+.}-{3:3}, at: rdma_destroy_id+0x17/0x20 [rdma_cm] [ 134.435986] but task is already holding lock: [ 134.436554] ffff888131327db0 ((work_completion)(&queue->release_work)){+.+.}-{0:0}, at: process_one_work+0x793/0x1350 [ 134.437459] which lock already depends on the new lock. [ 134.438235] the existing dependency chain (in reverse order) is: [ 134.438914] -> #3 ((work_completion)(&queue->release_work)){+.+.}-{0:0}: [ 134.439688] process_one_work+0x7dc/0x1350 [ 134.440084] worker_thread+0xfc/0x1260 [ 134.440489] kthread+0x29e/0x340 [ 134.440857] ret_from_fork+0x2c/0x50 [ 134.441251] -> #2 ((wq_completion)nvmet-wq){+.+.}-{0:0}: [ 134.441885] __flush_workqueue+0x130/0x12f0 [ 134.442328] nvmet_rdma_cm_handler+0x961/0x39c0 [nvmet_rdma] [ 134.442900] cma_cm_event_handler+0xb2/0x2f0 [rdma_cm] [ 134.443412] cma_ib_req_handler+0x1096/0x4a50 [rdma_cm] [ 134.443925] cm_process_work+0x46/0x3b0 [ib_cm] [ 134.444399] cm_req_handler+0x20e2/0x61d0 [ib_cm] [ 134.444880] cm_work_handler+0xc15/0x6ce0 [ib_cm] [ 134.445365] process_one_work+0x843/0x1350 [ 134.445794] worker_thread+0xfc/0x1260 [ 134.446165] kthread+0x29e/0x340 [ 134.447271] ret_from_fork+0x2c/0x50 [ 134.448373] -> #1 (&id_priv->handler_mutex/1){+.+.}-{3:3}: [ 134.450542] __mutex_lock+0x186/0x18f0 [ 134.451682] cma_ib_req_handler+0xc3c/0x4a50 [rdma_cm] [ 134.452921] cm_process_work+0x46/0x3b0 [ib_cm] [ 134.454150] cm_req_handler+0x20e2/0x61d0 [ib_cm] [ 134.455423] cm_work_handler+0xc15/0x6ce0 [ib_cm] [ 134.456680] process_one_work+0x843/0x1350 [ 134.457865] worker_thread+0xfc/0x1260 [ 134.458985] kthread+0x29e/0x340 [ 134.460097] ret_from_fork+0x2c/0x50 [ 134.461294] -> #0 (&id_priv->handler_mutex){+.+.}-{3:3}: [ 134.463349] __lock_acquire+0x2fc0/0x60f0 [ 134.464430] lock_acquire+0x1a7/0x4e0 [ 134.465448] __mutex_lock+0x186/0x18f0 [ 134.466461] rdma_destroy_id+0x17/0x20 [rdma_cm] [ 134.467569] nvmet_rdma_free_queue+0x7a/0x380 [nvmet_rdma] [ 134.468735] nvmet_rdma_release_queue_work+0x3e/0x90 [nvmet_rdma] [ 134.469904] process_one_work+0x843/0x1350 [ 134.470923] worker_thread+0xfc/0x1260 [ 134.471918] kthread+0x29e/0x340 [ 134.472902] ret_from_fork+0x2c/0x50 [ 134.473884] other info that might help us debug this: [ 134.476264] Chain exists of: &id_priv->handler_mutex --> (wq_completion)nvmet-wq --> (work_completion)(&queue->release_work) [ 134.479201] Possible unsafe locking scenario: [ 134.480949] CPU0 CPU1 [ 134.481941] ---- ---- [ 134.482871] lock((work_completion)(&queue->release_work)); [ 134.483938] lock((wq_completion)nvmet-wq); [ 134.485094] lock((work_completion)(&queue->release_work)); [ 134.486391] lock(&id_priv->handler_mutex); [ 134.487384] *** DEADLOCK *** [ 134.489712] 2 locks held by kworker/1:3/455: [ 134.490728] #0: ffff888127b28538 ((wq_completion)nvmet-wq){+.+.}-{0:0}, at: process_one_work+0x766/0x1350 [ 134.492084] #1: ffff888131327db0 ((work_completion)(&queue->release_work)){+.+.}-{0:0}, at: process_one_work+0x793/0x1350 [ 134.493538] stack backtrace: [ 134.495147] CPU: 1 PID: 455 Comm: kworker/1:3 Not tainted 6.3.0 #62 [ 134.496262] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc38 04/01/2014 [ 134.497573] Workqueue: nvmet-wq nvmet_rdma_release_queue_work [nvmet_rdma] [ 134.498769] Call Trace: [ 134.499649] <TASK> [ 134.500494] dump_stack_lvl+0x57/0x90 [ 134.501455] check_noncircular+0x27b/0x310 [ 134.502444] ? __pfx_mark_lock+0x10/0x10 [ 134.503413] ? __pfx_check_noncircular+0x10/0x10 [ 134.504431] ? find_held_lock+0x2d/0x110 [ 134.505399] ? lockdep_lock+0xca/0x1c0 [ 134.506351] ? __pfx_lockdep_lock+0x10/0x10 [ 134.507333] ? mark_held_locks+0x9e/0xe0 [ 134.508299] __lock_acquire+0x2fc0/0x60f0 [ 134.509272] ? __pfx___wait_for_common+0x10/0x10 [ 134.510292] ? __pfx___lock_acquire+0x10/0x10 [ 134.511296] ? rxe_sched_task+0x50/0x70 [rdma_rxe] [ 134.512328] ? rxe_post_recv+0x1d9/0x250 [rdma_rxe] [ 134.513366] lock_acquire+0x1a7/0x4e0 [ 134.514314] ? rdma_destroy_id+0x17/0x20 [rdma_cm] [ 134.515353] ? __pfx_lock_acquire+0x10/0x10 [ 134.516345] ? __pfx_ib_drain_qp_done+0x10/0x10 [ib_core] [ 134.517447] ? __pfx___might_resched+0x10/0x10 [ 134.518455] __mutex_lock+0x186/0x18f0 [ 134.519405] ? rdma_destroy_id+0x17/0x20 [rdma_cm] [ 134.520436] ? rdma_destroy_id+0x17/0x20 [rdma_cm] [ 134.521463] ? __pfx___mutex_lock+0x10/0x10 [ 134.522447] ? rcu_is_watching+0x11/0xb0 [ 134.523408] ? rdma_destroy_id+0x17/0x20 [rdma_cm] [ 134.524434] rdma_destroy_id+0x17/0x20 [rdma_cm] [ 134.525451] nvmet_rdma_free_queue+0x7a/0x380 [nvmet_rdma] [ 134.526528] nvmet_rdma_release_queue_work+0x3e/0x90 [nvmet_rdma] [ 134.527643] process_one_work+0x843/0x1350 [ 134.528600] ? __pfx_lock_acquire+0x10/0x10 [ 134.529562] ? __pfx_process_one_work+0x10/0x10 [ 134.530551] ? __pfx_do_raw_spin_lock+0x10/0x10 [ 134.531537] worker_thread+0xfc/0x1260 [ 134.532471] ? __kthread_parkme+0xc1/0x1f0 [ 134.533428] ? __pfx_worker_thread+0x10/0x10 [ 134.534385] kthread+0x29e/0x340 [ 134.535264] ? __pfx_kthread+0x10/0x10 [ 134.536164] ret_from_fork+0x2c/0x50 [ 134.537058] </TASK> [5] May 01 11:19:44 testnode1 kernel: ================================================================== May 01 11:19:44 testnode1 kernel: BUG: KASAN: slab-use-after-free in __mutex_lock+0x1324/0x18f0 May 01 11:19:44 testnode1 kernel: Read of size 8 at addr ffff888197b37418 by task kworker/u8:0/9 May 01 11:19:44 testnode1 kernel: May 01 11:19:44 testnode1 kernel: CPU: 0 PID: 9 Comm: kworker/u8:0 Not tainted 6.3.0 #62 May 01 11:19:44 testnode1 kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc38 04/01/2014 May 01 11:19:44 testnode1 kernel: Workqueue: iw_cm_wq cm_work_handler [iw_cm] May 01 11:19:44 testnode1 kernel: Call Trace: May 01 11:19:44 testnode1 kernel: <TASK> May 01 11:19:44 testnode1 kernel: dump_stack_lvl+0x57/0x90 May 01 11:19:44 testnode1 kernel: print_report+0xcf/0x660 May 01 11:19:44 testnode1 kernel: ? __mutex_lock+0x1324/0x18f0 May 01 11:19:44 testnode1 kernel: kasan_report+0xa4/0xe0 May 01 11:19:44 testnode1 kernel: ? __mutex_lock+0x1324/0x18f0 May 01 11:19:44 testnode1 kernel: __mutex_lock+0x1324/0x18f0 May 01 11:19:44 testnode1 kernel: ? cma_iw_handler+0xac/0x4f0 [rdma_cm] May 01 11:19:44 testnode1 kernel: ? _raw_spin_unlock_irqrestore+0x30/0x60 May 01 11:19:44 testnode1 kernel: ? rcu_is_watching+0x11/0xb0 May 01 11:19:44 testnode1 kernel: ? _raw_spin_unlock_irqrestore+0x30/0x60 May 01 11:19:44 testnode1 kernel: ? trace_hardirqs_on+0x12/0x100 May 01 11:19:44 testnode1 kernel: ? __pfx___mutex_lock+0x10/0x10 May 01 11:19:44 testnode1 kernel: ? __percpu_counter_sum+0x147/0x1e0 May 01 11:19:44 testnode1 kernel: ? domain_dirty_limits+0x246/0x390 May 01 11:19:44 testnode1 kernel: ? wb_over_bg_thresh+0x4d5/0x610 May 01 11:19:44 testnode1 kernel: ? rcu_is_watching+0x11/0xb0 May 01 11:19:44 testnode1 kernel: ? cma_iw_handler+0xac/0x4f0 [rdma_cm] May 01 11:19:44 testnode1 kernel: cma_iw_handler+0xac/0x4f0 [rdma_cm] May 01 11:19:44 testnode1 kernel: ? rcu_is_watching+0x11/0xb0 May 01 11:19:44 testnode1 kernel: ? __pfx_cma_iw_handler+0x10/0x10 [rdma_cm] May 01 11:19:44 testnode1 kernel: ? attach_entity_load_avg+0x4e2/0x920 May 01 11:19:44 testnode1 kernel: ? _raw_spin_unlock_irqrestore+0x30/0x60 May 01 11:19:44 testnode1 kernel: ? rcu_is_watching+0x11/0xb0 May 01 11:19:44 testnode1 kernel: cm_work_handler+0x139e/0x1c50 [iw_cm] May 01 11:19:44 testnode1 kernel: ? __pfx_cm_work_handler+0x10/0x10 [iw_cm] May 01 11:19:44 testnode1 kernel: ? rcu_is_watching+0x11/0xb0 May 01 11:19:44 testnode1 kernel: ? __pfx_try_to_wake_up+0x10/0x10 May 01 11:19:44 testnode1 kernel: ? __pfx_do_raw_spin_lock+0x10/0x10 May 01 11:19:44 testnode1 kernel: ? __pfx___might_resched+0x10/0x10 May 01 11:19:44 testnode1 kernel: ? _raw_spin_unlock_irq+0x24/0x50 May 01 11:19:44 testnode1 kernel: process_one_work+0x843/0x1350 May 01 11:19:44 testnode1 kernel: ? __pfx_lock_acquire+0x10/0x10 May 01 11:19:44 testnode1 kernel: ? __pfx_process_one_work+0x10/0x10 May 01 11:19:44 testnode1 kernel: ? __pfx_do_raw_spin_lock+0x10/0x10 May 01 11:19:44 testnode1 kernel: worker_thread+0xfc/0x1260 May 01 11:19:44 testnode1 kernel: ? __pfx_worker_thread+0x10/0x10 May 01 11:19:44 testnode1 kernel: kthread+0x29e/0x340 May 01 11:19:44 testnode1 kernel: ? __pfx_kthread+0x10/0x10 May 01 11:19:44 testnode1 kernel: ret_from_fork+0x2c/0x50 May 01 11:19:44 testnode1 kernel: </TASK> May 01 11:19:44 testnode1 kernel: May 01 11:19:44 testnode1 kernel: Allocated by task 4225: May 01 11:19:44 testnode1 kernel: kasan_save_stack+0x2f/0x50 May 01 11:19:44 testnode1 kernel: kasan_set_track+0x21/0x30 May 01 11:19:44 testnode1 kernel: __kasan_kmalloc+0xa6/0xb0 May 01 11:19:44 testnode1 kernel: __rdma_create_id+0x5b/0x5d0 [rdma_cm] May 01 11:19:44 testnode1 kernel: __rdma_create_kernel_id+0x12/0x40 [rdma_cm] May 01 11:19:44 testnode1 kernel: nvme_rdma_alloc_queue+0x26a/0x5f0 [nvme_rdma] May 01 11:19:44 testnode1 kernel: nvme_rdma_setup_ctrl+0xb84/0x1d90 [nvme_rdma] May 01 11:19:44 testnode1 kernel: nvme_rdma_create_ctrl+0x7b5/0xd20 [nvme_rdma] May 01 11:19:44 testnode1 kernel: nvmf_dev_write+0xddd/0x22b0 [nvme_fabrics] May 01 11:19:44 testnode1 kernel: vfs_write+0x211/0xd50 May 01 11:19:44 testnode1 kernel: ksys_write+0x100/0x1e0 May 01 11:19:44 testnode1 kernel: do_syscall_64+0x5b/0x80 May 01 11:19:44 testnode1 kernel: entry_SYSCALL_64_after_hwframe+0x72/0xdc May 01 11:19:44 testnode1 kernel: May 01 11:19:44 testnode1 kernel: Freed by task 4227: May 01 11:19:44 testnode1 kernel: kasan_save_stack+0x2f/0x50 May 01 11:19:44 testnode1 kernel: kasan_set_track+0x21/0x30 May 01 11:19:44 testnode1 kernel: kasan_save_free_info+0x2a/0x50 May 01 11:19:44 testnode1 kernel: ____kasan_slab_free+0x169/0x1c0 May 01 11:19:44 testnode1 kernel: slab_free_freelist_hook+0xdb/0x1b0 May 01 11:19:44 testnode1 kernel: __kmem_cache_free+0xb8/0x2e0 May 01 11:19:44 testnode1 kernel: nvme_rdma_free_queue+0x4a/0x70 [nvme_rdma] May 01 11:19:44 testnode1 kernel: nvme_rdma_teardown_io_queues.part.0+0x14a/0x1e0 [nvme_rdma] May 01 11:19:44 testnode1 kernel: nvme_rdma_delete_ctrl+0x4f/0x100 [nvme_rdma] May 01 11:19:44 testnode1 kernel: nvme_do_delete_ctrl+0x14e/0x240 [nvme_core] May 01 11:19:44 testnode1 kernel: nvme_sysfs_delete+0xcb/0x100 [nvme_core] May 01 11:19:44 testnode1 kernel: kernfs_fop_write_iter+0x359/0x530 May 01 11:19:44 testnode1 kernel: vfs_write+0x58f/0xd50 May 01 11:19:44 testnode1 kernel: ksys_write+0x100/0x1e0 May 01 11:19:44 testnode1 kernel: do_syscall_64+0x5b/0x80 May 01 11:19:44 testnode1 kernel: entry_SYSCALL_64_after_hwframe+0x72/0xdc May 01 11:19:44 testnode1 kernel: May 01 11:19:44 testnode1 kernel: The buggy address belongs to the object at ffff888197b37000 which belongs to the cache kmalloc-2k of size 2048 May 01 11:19:44 testnode1 kernel: The buggy address is located 1048 bytes inside of freed 2048-byte region [ffff888197b37000, ffff888197b37800) May 01 11:19:44 testnode1 kernel: May 01 11:19:44 testnode1 kernel: The buggy address belongs to the physical page: May 01 11:19:44 testnode1 kernel: page:00000000fbe33a6e refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x197b30 May 01 11:19:44 testnode1 kernel: head:00000000fbe33a6e order:3 entire_mapcount:0 nr_pages_mapped:0 pincount:0 May 01 11:19:44 testnode1 kernel: anon flags: 0x17ffffc0010200(slab|head|node=0|zone=2|lastcpupid=0x1fffff) May 01 11:19:44 testnode1 kernel: raw: 0017ffffc0010200 ffff888100042f00 0000000000000000 dead000000000001 May 01 11:19:44 testnode1 kernel: raw: 0000000000000000 0000000000080008 00000001ffffffff 0000000000000000 May 01 11:19:44 testnode1 kernel: page dumped because: kasan: bad access detected May 01 11:19:44 testnode1 kernel: May 01 11:19:44 testnode1 kernel: Memory state around the buggy address: May 01 11:19:44 testnode1 kernel: ffff888197b37300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb May 01 11:19:44 testnode1 kernel: ffff888197b37380: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb May 01 11:19:44 testnode1 kernel: >ffff888197b37400: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb May 01 11:19:44 testnode1 kernel: ^ May 01 11:19:44 testnode1 kernel: ffff888197b37480: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb May 01 11:19:44 testnode1 kernel: ffff888197b37500: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb May 01 11:19:44 testnode1 kernel: ==================================================================