On 27/12/2024 19:25, Zhu Yanjun wrote: > > > On 27.12.24 10:23, Zhu Yanjun wrote: >> On 27.12.24 06:37, Zhijian Li (Fujitsu) wrote: >>> Hi, Shin'ichiro, >>> >>> Your attached kconfig+this rnbd test triggered another BUG. >>> >>> Cced: RDMA >>> >>> Is this a known issue in RDMA/RXE communities? >> >> From my side, it should not be a known issue. It seems that it is related with this rnbd test. >> >> In previous tests, this problem does not appear. >> >> I am not sure if others also find this problem or not. To me, it is my first time to find this problem. > > FYI, Zhijian > > If I remember correctly, this problem does not occur in the previous kernel versions. Thus, it is very possible that this problem is introduced in recent kernel versions. > > Because you have a test scenario that can reproduce this problem, as such, "git bisect" is a powerful tool that can help you to find the root cause. Thank you for your advice, I will take a look later. Thanks Zhijian > > Have a good luck. > > Zhu Yanjun > >> >> Zhu Yanjun >> >>> >>> >>> On 26/12/2024 21:17, Shinichiro Kawasaki wrote: >>>> On Dec 25, 2024 / 17:37, Li Zhijian wrote: >>>>> Hi, Shin'ichiro >>>>> >>>>> All your comments has been addressed except the success ratio one. Could >>>>> you help to check this patch([NOT-FOR-MERGE] just for testing) that can tell >>>>> where it fails at in your envrionment. >>>>> >>>>> I tested it today in my QEMU enviroment, It almost 100% success >>>> >>>> Thanks for this effort. I ran rnbd/001 with this series in my QEMU environment. >>>> It looks still failing. Please find the 001.out.bad file generated [X]. The >>>> kernel was v6.13-rc4 with the fix patch "RDMA/ulp: Add missing deinit() call". >>>> >>>> I wonder what is the difference between your environment and mine. FYI, my QEMU >>>> environment has 4 CPUs and 16GB DRAM. It runs Fedora 40. I also attach the >>>> kernel config I used just in case you are interested in. >>> >>> >>> Due to this bug, I cannot finish rnbd/001 at all. >>> >>> However, I can reproduce your log by adding `_start_rnbd_client` before the iteration. >>> And it can be fixed by calling `_stop_rnbd_client` regardless of whether `_start_rnbd_client` >>> succeeds or not(Please feel free to give it a try when you have the opportunity). >>> >>> diff --git a/tests/rnbd/001 b/tests/rnbd/001 >>> index 9c6d56e3ee98..321c4c010e78 100755 >>> --- a/tests/rnbd/001 >>> +++ b/tests/rnbd/001 >>> @@ -26,6 +26,7 @@ test_start_stop() >>> local loop_dev i j=0 >>> loop_dev="$(losetup -f)" >>> + _start_rnbd_client # this makes the _start_rnbd_client in below iteration fails >>> for ((i=0;i<100;i++)) >>> do >>> if _start_rnbd_client "${loop_dev}" &>/dev/null; then >>> @@ -33,6 +34,7 @@ test_start_stop() >>> _stop_rnbd_client &>/dev/null && echo 'disconnect ok' || echo 'disconnect not ok' >>> ((j++)) >>> else >>> + _stop_rnbd_client # always stop rnbd so that we can connect again. >>> echo 'connect not ok' >>> fi >>> done >>> >>> =========================== >>> >>> [ 27.864420] run blktests rnbd/001 at 2024-12-27 13:21:37 >>> [ 27.888742] infiniband eth0_rxe: set active >>> [ 27.889497] infiniband eth0_rxe: added eth0 >>> [ 27.910304] rnbd_client L599: Mapping device /dev/loop0 on session blktest, (access_mode: rw, nr_poll_queues: 0) >>> [ 27.924065] rnbd_client L1190: [session=blktest] mapped 4/4 default/read queues. >>> [ 27.925825] rnbd_server L782: </dev/loop0@blktest>: Opened device 'loop0' >>> [ 27.927554] rnbd_client L1612: </dev/loop0@blktest> map_device: Device mapped as rnbd0 (nsectors: 0, logical_block_size: 512, physical_block_size: 512, max_write_zeroes_sectors: 0, max_discard_sectors: 0, discard_granularity: 51 >>> 2, discard_alignment: 0, secure_discard: 0, max_segments: 128, max_hw_sectors: 248, wc: 0, fua: 0) >>> [ 27.938295] rnbd_client L323: </dev/loop0@blktest> Unmapping device, option: normal. >>> [ 27.962570] rnbd_server L238: </dev/loop0@blktest>: Device closed >>> [ 27.967500] BUG: kernel NULL pointer dereference, address: 0000000000000000 >>> [ 27.967500] BUG: kernel NULL pointer dereference, address: 0000000000000000 13:21:38 [11/9189] >>> [ 27.976554] #PF: supervisor read access in kernel mode >>> [ 27.984926] #PF: error_code(0x0000) - not-present page >>> [ 27.989126] PGD 0 P4D 0 >>> [ 27.991067] Oops: Oops: 0000 [#1] PREEMPT SMP PTI >>> [ 27.993226] CPU: 3 UID: 0 PID: 304 Comm: kworker/u20:2 Not tainted 6.13.0-rc3+ #1 >>> [ 27.996697] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014 >>> [ 27.999333] Workqueue: rxe_wq do_work [rdma_rxe] >>> [ 28.000309] RIP: 0010:memcpy_orig+0xd5/0x140 >>> [ 28.001304] Code: 16 f8 4c 89 07 4c 89 4f 08 4c 89 54 17 f0 4c 89 5c 17 f8 c3 cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 83 fa 08 72 1b <4c> 8b 06 4c 8b 4c 16 f8 4c 89 07 4c 89 4c 17 f8 c3 cc cc cc cc 66 >>> [ 28.004932] RSP: 0018:ffffb934c0643cc0 EFLAGS: 00010246 >>> [ 28.005845] RAX: ffff976bc1e12d5a RBX: 0000000000000000 RCX: 0000000000000000 >>> [ 28.007090] RDX: 0000000000000008 RSI: 0000000000000000 RDI: ffff976bc1e12d5a >>> [ 28.008380] RBP: ffff976bc1e12d5a R08: 0000000000000001 R09: 0000000000000001 >>> [ 28.009639] R10: 0000000000000005 R11: 0000000000000000 R12: 0000000080000000 >>> [ 28.010836] R13: 0000000000000008 R14: 0000000000000008 R15: 0000000000000008 >>> [ 28.011948] FS: 0000000000000000(0000) GS:ffff976f2fd80000(0000) knlGS:0000000000000000 >>> [ 28.013335] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>> [ 28.014275] CR2: 0000000000000000 CR3: 00000001837da002 CR4: 00000000001706f0 >>> [ 28.015424] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 >>> [ 28.016598] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 >>> [ 28.017728] Call Trace: >>> [ 28.018114] <TASK> >>> [ 28.018453] ? __die_body.cold+0x19/0x27 >>> [ 28.019167] ? page_fault_oops+0x15a/0x2d0 >>> [ 28.019861] ? search_module_extables+0x19/0x60 >>> [ 28.020617] ? search_bpf_extables+0x5f/0x80 >>> [ 28.021611] ? exc_page_fault+0x7e/0x180 >>> [ 28.022488] ? asm_exc_page_fault+0x26/0x30 >>> [ 28.023547] ? memcpy_orig+0xd5/0x140 >>> [ 28.024396] rxe_mr_copy+0x1c3/0x200 [rdma_rxe] >>> [ 28.025476] ? rxe_pool_get_index+0x4b/0x80 [rdma_rxe] >>> [ 28.026612] copy_data+0xa5/0x230 [rdma_rxe] >>> [ 28.027611] rxe_requester+0xd9b/0xf70 [rdma_rxe] >>> [ 28.028727] ? finish_task_switch.isra.0+0x99/0x2e0 >>> [ 28.029878] rxe_sender+0x13/0x40 [rdma_rxe] >>> [ 28.030920] do_task+0x68/0x1e0 [rdma_rxe] >>> [ 28.031893] process_one_work+0x177/0x330 >>> [ 28.032854] worker_thread+0x252/0x390 >>> [ 28.033748] ? __pfx_worker_thread+0x10/0x10 >>> [ 28.034665] kthread+0xd2/0x100 >>> [ 28.035382] ? __pfx_kthread+0x10/0x10 >>> [ 28.036252] ret_from_fork+0x34/0x50 >>> [ 28.037220] ? __pfx_kthread+0x10/0x10 >>> [ 28.038072] ret_from_fork_asm+0x1a/0x30 >>> [ 28.038991] </TASK> >>> [ 28.039543] Modules linked in: loop rnbd_client rtrs_client rnbd_server rtrs_server rtrs_core rdma_cm iw_cm ib_cm rdma_rxe ib_uverbs ib_core ip6_udp_tunnel udp_tunnel rfkill intel_rapl_msr intel_rapl_common kmem rapl cxl_mem iTC >>> O_wdt intel_pmc_bxt cxl_pmem dax_hmem iTCO_vendor_support device_dax cxl_acpi cxl_pci cxl_port joydev qxl cxl_core pcspkr drm_ttm_helper lpc_ich ttm i2c_i801 virtio_balloon i2c_smbus nd_pmem nd_btt dax_pmem einj ip_tables crct10dif >>> _pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3 sha256_ssse3 sha1_ssse3 virtiofs fuse virtio_net nfit virtio_console net_failover libnvdimm serio_raw virtio_blk failover qemu_fw_cf >>> g dm_multipath sunrpc >>> [ 28.051034] CR2: 0000000000000000 >>> [ 28.052072] ---[ end trace 0000000000000000 ]--- >>> [ 28.053099] RIP: 0010:memcpy_orig+0xd5/0x140 >>> [ 28.054188] Code: 16 f8 4c 89 07 4c 89 4f 08 4c 89 54 17 f0 4c 89 5c 17 f8 c3 cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 83 fa 08 72 1b <4c> 8b 06 4c 8b 4c 16 f8 4c 89 07 4c 89 4c 17 f8 c3 cc cc cc cc 66 >>> [ 28.058290] RSP: 0018:ffffb934c0643cc0 EFLAGS: 00010246 >>> [ 28.059514] RAX: ffff976bc1e12d5a RBX: 0000000000000000 RCX: 0000000000000000 >>> [ 28.061194] RDX: 0000000000000008 RSI: 0000000000000000 RDI: ffff976bc1e12d5a >>> [ 28.062588] RBP: ffff976bc1e12d5a R08: 0000000000000001 R09: 0000000000000001 >>> >>> >>> >>> >>>> >>>> >>>> [X] >>>> >>>> 001.out.bad >>>> ---------------------------------------------------------------------------- >>>> Running rnbd/001 >>>> connect ok >>>> disconnect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> connect not ok >>>> Failed: 1/100 >>>> Test complete >> >