On 7/11/23 00:50, Shinichiro Kawasaki wrote:
With kernel version v6.5-rc1, I observed an I/O error during a fio run on zoned
block devices. I bisected and found that the commit 0effb390c4ba ("block:
mq-deadline: Handle requeued requests correctly") is the trigger. When I revert
this commit from v6.5-rc1, the error disappears.
At first, the error was observed as a test case failure of fio test script for
zoned block devices (t/zbd/test-zbd-support, #34), using a QEMU ZNS emulation
device with 4MB zone size. The failure was also observed with a zoned null_blk
device with 4MB zone size and memory backed option. The error was observed with
real ZNS drives with 2GB zone size as well.
I simplified the fio test script and confirmed that the short script below [1]
recreates the error using the null_blk device with 4MB zone size and memory
backed option.
The trigger commit modifies the order to dispatch write requests to zones. To
check the write requests dispatched to the null_blk device, I took blktrace [2].
It shows that 1MB write to the first zone (sector 0) is divided into size of 255
sectors. One of the divided write requests was dispatched to the zone but it was
not a write at zone start, then it caused the I/O error. I think this I/O error
is caused by unaligned write command error on the device. Later on, another
write request to the zone start was dispatched. So, it does not look the write
requests are well ordered.
I call for a help to resolve this issue. If any actions on my test systems will
help, please let me know.
[1]
#!/bin/bash
dev=$1
realdev=$(readlink -f "$dev")
basename=$(basename "$realdev")
echo mq-deadline >"/sys/block/$basename/queue/scheduler"
blkzone reset $dev
fio --name=job --filename="${dev}" --ioengine=libaio --iodepth=256 \
--rw=randwrite --bs=1M --offset=0 --size=16M \
--zonemode=zbd --direct=1 --zonesize=4M
[2]
...
251,0 1 136 0.871020525 1300 Q WS 0 + 2048 [fio]
251,0 1 137 0.871025680 1300 X WS 0 / 255 [fio]
251,0 1 138 0.871027679 1300 G WS 0 + 255 [fio]
251,0 1 139 0.871028675 1300 I WS 0 + 255 [fio]
251,0 1 140 0.871038432 1300 X WS 255 / 510 [fio]
251,0 1 141 0.871040086 1300 G WS 255 + 255 [fio]
251,0 1 142 0.871040949 1300 I WS 255 + 255 [fio]
251,0 1 143 0.871050035 1300 X WS 510 / 765 [fio]
251,0 1 144 0.871051688 1300 G WS 510 + 255 [fio]
251,0 1 145 0.871052551 1300 I WS 510 + 255 [fio]
251,0 3 8 0.871054865 1115 C WS 24576 + 765 [0]
251,0 1 146 0.871061570 1300 X WS 765 / 1020 [fio]
251,0 1 147 0.871063327 1300 G WS 765 + 255 [fio]
251,0 1 148 0.871064204 1300 I WS 765 + 255 [fio]
251,0 1 149 0.871073358 1300 X WS 1020 / 1275 [fio]
251,0 1 150 0.871075004 1300 G WS 1020 + 255 [fio]
251,0 3 9 0.871075262 1115 D WS 510 + 255 [kworker/3:2H] ... Write not at zone start
251,0 1 151 0.871075921 1300 I WS 1020 + 255 [fio]
251,0 3 10 0.871077227 1115 C WS 0 + 765 [65531] ... I/O error
251,0 1 152 0.871085051 1300 X WS 1275 / 1530 [fio]
...
251,0 3 281 0.904191667 1115 D WS 0 + 255 [kworker/3:2H] ... Write at zone start comes after
251,0 3 282 0.904445591 1115 C WS 0 + 255 [0]
...
Thank you for the detailed report. Does this patch help?
diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index 6aa5daf7ae32..02a916ba62ee 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -176,7 +176,7 @@ static inline struct request
*deadline_from_pos(struct dd_per_prio *per_prio,
* zoned writes, start searching from the start of a zone.
*/
if (blk_rq_is_seq_zoned_write(rq))
- pos -= round_down(pos, rq->q->limits.chunk_sectors);
+ pos = round_down(pos, rq->q->limits.chunk_sectors);
while (node) {
rq = rb_entry_rq(node);