Hi, I am running fio over a RDMA block device. The server side of this mapping is an md-raid0 device, created over 3 md-raid5 devices. The md-raid5 devices each are created over 8 block devices. Below is how the raid configuration looks (md400, md300, md301 and md302 are relevant for this discussion here). $ cat /proc/mdstat Personalities : [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] md400 : active raid0 md300[0] md302[2] md301[1] 19688371968 blocks super 1.2 128k chunks md302 : active raid5 sds[0] sdz[7] sdy[6] sdx[5] sdw[4] sdv[3] sdu[2] sdt[1] 6562922800 blocks super 1.2 level 5, 16k chunk, algorithm 2 [8/8] [UUUUUUUU] bitmap: 0/1 pages [0KB], 524288KB chunk md301 : active raid5 sdk[0] sdr[7] sdq[6] sdp[5] sdo[4] sdn[3] sdm[2] sdl[1] 6562922800 blocks super 1.2 level 5, 16k chunk, algorithm 2 [8/8] [UUUUUUUU] bitmap: 0/1 pages [0KB], 524288KB chunk md300 : active raid5 sda[0] sdh[7] sdg[6] sdf[5] sde[4] sdd[3] sdc[2] sdb[1] 6562922800 blocks super 1.2 level 5, 16k chunk, algorithm 2 [8/8] [UUUUUUUU] bitmap: 0/1 pages [0KB], 524288KB chunk md126 : active raid1 sdi3[0] sdj3[1] 117096448 blocks super 1.2 [2/2] [UU] bitmap: 0/1 pages [0KB], 65536KB chunk md127 : active raid1 sdi2[0] sdj2[1] 117096448 blocks super 1.2 [2/2] [UU] bitmap: 1/1 pages [4KB], 65536KB chunk unused devices: <none> The RDMA mapping is through the RNBD/RTRS kernel module, in which RTRS provides the RDMA transport, and RNBD creates the block device layer. The md400 device is mapped and on the client side, and the fio profile I run is as following, $ cat fio_single.ini [global] description=Emulation of Storage Server Access Pattern bssplit=512/20:1k/16:2k/9:4k/12:8k/19:16k/10:32k/8:64k/4 fadvise_hint=0 rw=randrw:2 direct=1 random_distribution=zipf:1.2 time_based=1 runtime=60 ramp_time=1 ioengine=libaio iodepth=128 iodepth_batch_submit=128 iodepth_batch_complete_min=1 iodepth_batch_complete_max=128 numjobs=10 group_reporting [job1] filename=/dev/rnbd0 do_verify=1 The hang is easily reproducible, and I hit is almost every time under the first 30 seconds of the fio. We see 2 different types of stack traces and lockups in dmesg, and when we dump stack for every CPU. I have shared both of them in separate files. (PS: We have done some tests on the v6.1 kernel also, and we experience the same hang. Have not tested any other kernel version.) Regards -Haris
Attachment:
sysrq_l_6_11_2
Description: Binary data
Attachment:
sysrq_l_6_11_1
Description: Binary data