On 8/1/19 7:05 PM, Damien Le Moal wrote: > On 2019/08/02 4:51, Jens Axboe wrote: >> On 8/1/19 4:21 AM, Damien Le Moal wrote: >>> The recent fix to properly handle IOCB_NOWAIT for async O_DIRECT IO >>> (patch 6a43074e2f46) introduced two problems with BIO fragment handling >>> for direct IOs: >>> 1) The dio size processed is claculated by incrementing the ret variable >>> by the size of the bio fragment issued for the dio. However, this size >>> is obtained directly from bio->bi_iter.bi_size AFTER the bio submission >>> which may result in referencing the bi_size value after the bio >>> completed, resulting in an incorrect value use. >>> 2) The ret variable is not incremented by the size of the last bio >>> fragment issued for the bio, leading to an invalid IO size being >>> returned to the user. >>> >>> Fix both problem by using dio->size (which is incremented before the bio >>> submission) to update the value of ret after bio submissions, including >>> for the last bio fragment issued. >> >> Thanks, applied. Do you have a test case? I ran this through the usual >> xfstests and block bits, but didn't catch anything. >> > > The problem was detected with our weekly RC test runs for zoned block > devices. RC1 last week was OK, but failures happen on RC2 Monday. We > never hit a oops for the BIO reference after submission but we were > getting a lot of unaligned write errors for all types of zoned drive > tested (real SMR disks, tcmu-runner ZBC handler disks and nullblk in > zoned mode) using various applications (fio, dd, ...) > > Masato isolated the problem to very large direct writes and could > reliably recreate the problem with a dd doing a single 8MB write to a > sequential zone. With this case, blktrace showed that the 8MB write > was split into multiple BIOs (expected) and the whole 8MB being > cleanly written sequentially. But this was followed by a stream of > small 4K writes starting around the end of the 8MB chunk, but within > it, causing unaligned write errors (overwrite in sequential zones not > being possible). > > dd if=/dev/zero of=/dev/nullb0 bs=8M oflag=direct count=1 > > On a nullblk disk in zoned mode should recreate the problem, and > blktrace revealing that dd sees a short write for the 8M IO and issues > remaining as 4K requests. > > Using a regular disk, this however does not generate any error at all, > which may explain why you did not see any problem. I think we can > create a blktest case for this using nullblk in zoned mode. Would you > like us to send one ? Thanks for the detailed explanation. I think we should absolutely have a blktests case for this, even if O_DIRECT isn't strictly block level, it's definitely in a grey zone that I think we should cover. -- Jens Axboe