On 2019/08/02 12:57, Jens Axboe wrote: > On 8/1/19 7:05 PM, Damien Le Moal wrote: >> On 2019/08/02 4:51, Jens Axboe wrote: >>> On 8/1/19 4:21 AM, Damien Le Moal wrote: >>>> The recent fix to properly handle IOCB_NOWAIT for async O_DIRECT IO >>>> (patch 6a43074e2f46) introduced two problems with BIO fragment handling >>>> for direct IOs: >>>> 1) The dio size processed is claculated by incrementing the ret variable >>>> by the size of the bio fragment issued for the dio. However, this size >>>> is obtained directly from bio->bi_iter.bi_size AFTER the bio submission >>>> which may result in referencing the bi_size value after the bio >>>> completed, resulting in an incorrect value use. >>>> 2) The ret variable is not incremented by the size of the last bio >>>> fragment issued for the bio, leading to an invalid IO size being >>>> returned to the user. >>>> >>>> Fix both problem by using dio->size (which is incremented before the bio >>>> submission) to update the value of ret after bio submissions, including >>>> for the last bio fragment issued. >>> >>> Thanks, applied. Do you have a test case? I ran this through the usual >>> xfstests and block bits, but didn't catch anything. >>> >> >> The problem was detected with our weekly RC test runs for zoned block >> devices. RC1 last week was OK, but failures happen on RC2 Monday. We >> never hit a oops for the BIO reference after submission but we were >> getting a lot of unaligned write errors for all types of zoned drive >> tested (real SMR disks, tcmu-runner ZBC handler disks and nullblk in >> zoned mode) using various applications (fio, dd, ...) >> >> Masato isolated the problem to very large direct writes and could >> reliably recreate the problem with a dd doing a single 8MB write to a >> sequential zone. With this case, blktrace showed that the 8MB write >> was split into multiple BIOs (expected) and the whole 8MB being >> cleanly written sequentially. But this was followed by a stream of >> small 4K writes starting around the end of the 8MB chunk, but within >> it, causing unaligned write errors (overwrite in sequential zones not >> being possible). >> >> dd if=/dev/zero of=/dev/nullb0 bs=8M oflag=direct count=1 >> >> On a nullblk disk in zoned mode should recreate the problem, and >> blktrace revealing that dd sees a short write for the 8M IO and issues >> remaining as 4K requests. >> >> Using a regular disk, this however does not generate any error at all, >> which may explain why you did not see any problem. I think we can >> create a blktest case for this using nullblk in zoned mode. Would you >> like us to send one ? > > Thanks for the detailed explanation. I think we should absolutely have a > blktests case for this, even if O_DIRECT isn't strictly block level, > it's definitely in a grey zone that I think we should cover. > OK. We will work on a test case and post it. -- Damien Le Moal Western Digital Research