[bisected] RAID1 direct IO redirecting sector loop since 6.0

Dmitrii Tcvetkov <me@xxxxxxxxx> · Tue, 1 Nov 2022 00:15:58 +0300

Hello,

I have LVM on mdraid RAID1 array of 2 HDDs, LVs are raw disk images for                        
QEMU 7.1.0 guests. Complete storage stack is: lvm(mdadm(luks(gpt(SATA HDD)))). LUKS2 is        
deliberately with 4096 sectors despite underlying drive having 512 byte sectors. Problem      
reproduces on kernels 6.0+ including 6.1-rc3.                                                  

Host kernel:                                                                                   
Linux version 6.1-rc3 (portage@xxxxxxxxxxxxxx) (gcc (Gentoo 11.3.0 p4)                         
11.3.0, GNU ld (Gentoo 2.38 p4) 2.38) #1 SMP                                                   

#lsblk -t -a                                                                                   
NAME                 ALIGNMENT MIN-IO OPT-IO PHY-SEC LOG-SEC ROTA SCHED RQ-SIZE  RA WSAME                                                                                                      
...                                                                                            
sdd                          0    512      0     512     512    1 bfq        64 128    0B
└─sdd2                       0    512      0     512     512    1 bfq        64 128    0B
  └─raid1                    0   4096      0    4096    4096    1           128 128    0B
    └─md0                    0   4096      0    4096    4096    1           128 128    0B
      ├─lvmraid-zfs          0   4096      0    4096    4096    1           128 128    0B
      ├─lvmraid-obsd         0   4096      0    4096    4096    1           128 128    0B
      └─lvmraid-wrk          0   4096      0    4096    4096    1           128 128    0B
sde                          0    512      0     512     512    1 bfq        64 128    0B
└─sde2                       0    512      0     512     512    1 bfq        64 128    0B
  └─raid2                    0   4096      0    4096    4096    1           128 128    0B
    └─md0                    0   4096      0    4096    4096    1           128 128    0B
      ├─lvmraid-zfs          0   4096      0    4096    4096    1           128 128    0B
      ├─lvmraid-obsd         0   4096      0    4096    4096    1           128 128    0B
      └─lvmraid-wrk          0   4096      0    4096    4096    1           128 128    0B

sdd2 and sde2 are GPT partitions, raid1 and raid2 are LUKS2 devices, command:
# cryptsetup luksFormat --sector-size 4096 /dev/sdd2
# cryptsetup luksFormat --sector-size 4096 /dev/sde2
md0 command:
# mdadm --create --level=1 --metadata=1.2 --raid-devices=2 /dev/md0 /dev/mapper/raid1 /dev/mapper/raid2
LV command:
# pvcreate /dev/md0
# vgcreate lvmraid /de/vmd0
# lvcreate -n wrk -L 20G lvmraid

# cat /proc/mdstat
Personalities : [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] 
md0 : active raid1 dm-6[1] dm-5[2]
      685644544 blocks super 1.2 [2/2] [UU]
      bitmap: 0/6 pages [0KB], 65536KB chunk

All guests are started like this:
# cat run.sh
DISK="/dev/lvmraid/wrk"

qemu-system-x86_64 -enable-kvm -smp 4 -m 8G -cpu host -machine q35                \
-drive file=$DISK,format=raw,if=none,id=hdd,cache=none,aio=io_uring               \
-device virtio-scsi-pci,id=scsi                                                   \
-device scsi-hd,drive=hdd,logical_block_size=4096,physical_block_size=4096        \
-nodefaults                                                                       \
-object rng-random,filename=/dev/urandom,id=rng0                                  \
-device virtio-rng-pci,rng=rng0                                                   \
-vga qxl -spice port=5090,disable-ticketing=on -machine usb=on -device usb-tablet \
-netdev bridge,br=br0,id=pub -device virtio-net-pci,netdev=pub

Each guest uses it's own LV in lvmraid and own spice port.

One guest at a time works fine on all kernels. After upgrading KVM host kernel
to 6.0 starting second guest using LV from the same VG results in second
guest sleeping in D state seemingly indefinetely.

These messages appear in dmesg in a loop:
[  852.721444] md/raid1:md0: redirecting sector 62916608 to other
mirror: dm-5
[  852.721519] md/raid1:md0: dm-5: rescheduling sector 62916608 
[  852.721608] md/raid1:md0: redirecting sector 62916608 to other
mirror: dm-5
[  852.721682] md/raid1:md0: dm-5: rescheduling sector 62916608
[  852.721769] md/raid1:md0: redirecting sector 62916608 to other
mirror: dm-5
[  852.721846] md/raid1:md0: dm-5: rescheduling sector 62916608  

The only workaround I found to resolve this at runtime is:
# mdadm --fail /dev/md0 /dev/dm-5
# mdadm --re-add /dev/md0 /dev/dm-5

dmesg after that is as it should:
[ 6094.171989] md/raid1:md0: redirecting sector 62916608 to other mirror: dm-5
[ 6094.172058] md/raid1:md0: dm-5: rescheduling sector 62916608
[ 6094.172139] md/raid1:md0: redirecting sector 62916608 to other mirror: dm-5
[ 6094.172207] md/raid1:md0: dm-5: rescheduling sector 62916608
[ 6095.357995] md/raid1:md0: Disk failure on dm-5, disabling device.
               md/raid1:md0: Operation continuing on 1 devices.
[ 6101.321556] md: recovery of RAID array md0
[ 6101.321732] md: md0: recovery done.

The reported sector and sectors around it are always readable on both devices:
# dd if=/dev/dm-5 iflag=direct bs=4096 skip=62916606 count=4 | hexdump -C
00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00004000
4+0 records in
4+0 records out
16384 bytes (16 kB, 16 KiB) copied, 0.00080541 s, 20.3 MB/s

# dd if=/dev/dm-6 iflag=direct bs=4096 skip=62916606 count=4 | hexdump -C
00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00004000
4+0 records in
4+0 records out
16384 bytes (16 kB, 16 KiB) copied, 0.0118922 s, 1.4 MB/s

Kernel doesn't kick any members of md0 on it's own.

Failing dm-5 or dm-6 doesn't change the result, as far as I understand
removing redundancy allows kernel to break from the loop.

After that guest works fine. However after guest does 
graceful shutdown then the problem reproduces again and can be
remedied with mdadm --fail again. In other words it happens during guest
creation and guest teardown.

fragment of strace, the last pread64 call hangs:
[pid  7906] openat(AT_FDCWD, "/dev/lvmraid/wrk", O_RDONLY|O_NONBLOCK|O_CLOEXEC) = 13
[pid  7906] newfstatat(13, "", {st_mode=S_IFBLK|0660, st_rdev=makedev(0xfd, 0x9), ...}, AT_EMPTY_PATH) = 0
[pid  7906] ioctl(13, CDROM_DRIVE_STATUS, 0x7fffffff) = -1 ENOTTY (Inappropriate ioctl for device)
[pid  7906] close(13)                   = 0
[pid  7906] newfstatat(AT_FDCWD, "/dev/lvmraid/wrk", {st_mode=S_IFBLK|0660, st_rdev=makedev(0xfd, 0x9), ...}, 0) = 0
[pid  7906] openat(AT_FDCWD, "/dev/lvmraid/wrk", O_RDONLY|O_DIRECT|O_CLOEXEC) = 13
[pid  7906] io_uring_setup(128, {flags=0, sq_thread_cpu=0, sq_thread_idle=0, sq_entries=128, cq_entries=256, features=IORING_FEAT_SINGLE_MMAP|IORING_FEAT_NODROP|IORING_FEAT_SUBMIT_STABLE|IORI
NG_FEAT_RW_CUR_POS|IORING_FEAT_CUR_PERSONALITY|IORING_FEAT_FAST_POLL|IORING_FEAT_POLL_32BITS|IORING_FEAT_SQPOLL_NONFIXED|IORING_FEAT_EXT_ARG|IORING_FEAT_NATIVE_WORKERS|IORING_FEAT_RSRC_TAGS|I
ORING_FEAT_CQE_SKIP|IORING_FEAT_LINKED_FILE, sq_off={head=0, tail=64, ring_mask=256, ring_entries=264, flags=276, dropped=272, array=4416}, cq_off={head=128, tail=192, ring_mask=260, ring_ent
ries=268, overflow=284, cqes=320, flags=280}}) = 14
[pid  7906] mmap(NULL, 4928, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_POPULATE, 14, 0) = 0x7958eca02000
[pid  7906] mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_POPULATE, 14, 0x10000000) = 0x7958ec9fa000
[pid  7906] io_uring_register(14, IORING_REGISTER_RING_FDS, [{offset=-1, data=14}], 1) = 1
[pid  7906] write(6, "\1\0\0\0\0\0\0\0", 8) = 8 
[pid  7906] newfstatat(13, "", {st_mode=S_IFBLK|0660, st_rdev=makedev(0xfd, 0x9), ...}, AT_EMPTY_PATH) = 0
[pid  7906] ioctl(13, BLKDISCARDZEROES, [0]) = 0
[pid  7906] fstatfs(13, {f_type=TMPFS_MAGIC, f_bsize=4096, f_blocks=2560, f_bfree=2560, f_bavail=2560, f_files=4016118, f_ffree=4015380, f_fsid={val=[0x73a66ace, 0xdefebdad]}, f_namelen=255, 
f_frsize=4096, f_flags=ST_VALID|ST_NOSUID|ST_RELATIME}) = 0
[pid  7906] newfstatat(AT_FDCWD, "/dev/lvmraid/wrk", {st_mode=S_IFBLK|0660, st_rdev=makedev(0xfd, 0x9), ...}, 0) = 0
[pid  7906] lseek(13, 0, SEEK_END)      = 21474836480
[pid  7906] fstatfs(13, {f_type=TMPFS_MAGIC, f_bsize=4096, f_blocks=2560, f_bfree=2560, f_bavail=2560, f_files=4016118, f_ffree=4015380, f_fsid={val=[0x73a66ace, 0xdefebdad]}, f_namelen=255, 
f_frsize=4096, f_flags=ST_VALID|ST_NOSUID|ST_RELATIME}) = 0
[pid  7906] ioctl(13, BLKSSZGET, [4096]) = 0
[pid  7906] ioctl(13, _IOC(_IOC_READ, 0x58, 0x1e, 0xc), 0x7fff464e32c8) = -1 ENOTTY (Inappropriate ioctl for device)
[pid  7906] pread64(13, 0x627c34c8d001, 4096, 0) = -1 EINVAL (Invalid argument)
[pid  7906] pread64(13,

# cat /proc/7906/stack
[<0>] submit_bio_wait+0xdb/0x140
[<0>] blkdev_direct_IO+0x62f/0x770
[<0>] blkdev_read_iter+0xc1/0x140
[<0>] vfs_read+0x34e/0x3c0
[<0>] __x64_sys_pread64+0x74/0xc0
[<0>] do_syscall_64+0x6a/0x90
[<0>] entry_SYSCALL_64_after_hwframe+0x4b/0xb5

After "mdadm --fail" invocation the last line becomes:
[pid  7906] pread64(13, 0x627c34c8d200, 4096, 0) = -1 EIO (Input/output error)

fragment of guest teardown strace:
[pid  8986] openat(AT_FDCWD, "/dev/lvmraid/wrk", O_RDONLY|O_DIRECT|O_CLOEXEC) = 11
[pid  8986] ioctl(11, BLKSSZGET, [4096]) = 0
[pid  8986] ioctl(11, _IOC(_IOC_READ, 0x58, 0x1e, 0xc), 0x7ffd776f4da8) = -1 ENOTTY (Inappropriate ioctl for device)
[pid  8986] pread64(11, 0x62f8d5600001, 4096, 0) = -1 EINVAL (Invalid argument)
[pid  8986] pread64(11,

After "mdadm --fail" last line becomes:
[pid  8986] pread64(11, 0x62f8d5600200, 4096, 0) = -1 EIO (Input/output error)

Stack trace in /proc/8986/stack is identical to stack trace from above.
As I understand EINVAL from pread64 here is expected as 13 fd is opened with O_DIRECT.
However it's strange that kernel loops seemingly indefinetely on second pread64 until
"mdadm --fail".

Kernels 5.19 and earlier start multiple guests successfully
without the problem. On 6.0+ only first guest starts without problem,
subsequent guests end up in D state until "mdadm --fail".

What other info I can provide? I can apply debug or fixing patches.

Bisect led me to commit b1a000d3b8ec5  ("block: relax direct io memory
alignment").

Bisect log:
git bisect start
# status: waiting for both good and bad commits 
# bad: [4fe89d07dcc2804c8b562f6c7896a45643d34b2f] Linux 6.0
git bisect bad 4fe89d07dcc2804c8b562f6c7896a45643d34b2f
# status: waiting for good commit(s), bad commit known
# good: [3d7cb6b04c3f3115719235cc6866b10326de34cd] Linux 5.19
git bisect good 3d7cb6b04c3f3115719235cc6866b10326de34cd
# bad: [78acd4ca433425e6dd4032cfc2156c60e34931f2] usb: cdns3: Don't use priv_dev uninitialized in cdns3_gadget_ep_enable()
git bisect bad 78acd4ca433425e6dd4032cfc2156c60e34931f2
# bad: [526942b8134cc34d25d27f95dfff98b8ce2f6fcd] Merge tag 'ata-5.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata
git bisect bad 526942b8134cc34d25d27f95dfff98b8ce2f6fcd
# good: [2e7a95156d64667a8ded606829d57c6fc92e41df] Merge tag 'regmap-v5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap
git bisect good 2e7a95156d64667a8ded606829d57c6fc92e41df
# bad: [c013d0af81f60cc7dbe357c4e2a925fb6738dbfe] Merge tag 'for-5.20/block-2022-07-29' of git://git.kernel.dk/linux-block
git bisect bad c013d0af81f60cc7dbe357c4e2a925fb6738dbfe
# good: [efb2883060afc79638bb1eb19e2c30e7f6c5a178] Merge branch 'turbostat' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux
git bisect good efb2883060afc79638bb1eb19e2c30e7f6c5a178
# good: [cb309ae49da7a7c28f0051deea13970291134fac] io_uring/net: improve io_get_notif_slot types
git bisect good cb309ae49da7a7c28f0051deea13970291134fac
# bad: [22c80aac882f712897b88b7ea8f5a74ea19019df] blktrace: Trace remapped requests correctly
git bisect bad 22c80aac882f712897b88b7ea8f5a74ea19019df
# bad: [22d0c4080fe49299640d9d6c43154c49794c2825] block: simplify disk_set_independent_access_ranges
git bisect bad 22d0c4080fe49299640d9d6c43154c49794c2825
# bad: [3c8f9da41ed90294d8ca42b3ad8a13c5379bd549] blk-mq: Don't disable preemption around __blk_mq_run_hw_queue().
git bisect bad 3c8f9da41ed90294d8ca42b3ad8a13c5379bd549
# bad: [798f2a6f734de87633351c3ab13b17b07397cf68] block: Directly use ida_alloc()/free()
git bisect bad 798f2a6f734de87633351c3ab13b17b07397cf68
# good: [67927d22015060967122facc8cfeaad8012e8808] block/merge: count bytes instead of sectors
git bisect good 67927d22015060967122facc8cfeaad8012e8808
# good: [5debd9691c3ac64c3acd6867c264ad38bbe48cdc] block: introduce bdev_iter_is_aligned helper 
git bisect good 5debd9691c3ac64c3acd6867c264ad38bbe48cdc
# bad: [bf8d08532bc19a14cfb54ae61099dccadefca446] iomap: add support for dma aligned direct-io
git bisect bad bf8d08532bc19a14cfb54ae61099dccadefca446
# bad: [b1a000d3b8ec582da64bb644be633e5a0beffcbf] block: relax direct io memory alignment
git bisect bad b1a000d3b8ec582da64bb644be633e5a0beffcbf
# first bad commit: [b1a000d3b8ec582da64bb644be633e5a0beffcbf] block: relax direct io memory alignment