Re: [PATCH] block: Fix __blkdev_direct_IO()

Damien Le Moal <Damien.LeMoal@xxxxxxx> · Fri, 2 Aug 2019 01:05:19 +0000

On 2019/08/02 4:51, Jens Axboe wrote:
> On 8/1/19 4:21 AM, Damien Le Moal wrote:
>> The recent fix to properly handle IOCB_NOWAIT for async O_DIRECT IO
>> (patch 6a43074e2f46) introduced two problems with BIO fragment handling
>> for direct IOs:
>> 1) The dio size processed is claculated by incrementing the ret variable
>> by the size of the bio fragment issued for the dio. However, this size
>> is obtained directly from bio->bi_iter.bi_size AFTER the bio submission
>> which may result in referencing the bi_size value after the bio
>> completed, resulting in an incorrect value use.
>> 2) The ret variable is not incremented by the size of the last bio
>> fragment issued for the bio, leading to an invalid IO size being
>> returned to the user.
>>
>> Fix both problem by using dio->size (which is incremented before the bio
>> submission) to update the value of ret after bio submissions, including
>> for the last bio fragment issued.
> 
> Thanks, applied. Do you have a test case? I ran this through the usual
> xfstests and block bits, but didn't catch anything.
> 

The problem was detected with our weekly RC test runs for zoned block devices.
RC1 last week was OK, but failures happen on RC2 Monday. We never hit a oops for
the BIO reference after submission but we were getting a lot of unaligned write
errors for all types of zoned drive tested (real SMR disks, tcmu-runner ZBC
handler disks and nullblk in zoned mode) using various applications (fio, dd, ...)

Masato isolated the problem to very large direct writes and could reliably
recreate the problem with a dd doing a single 8MB write to a sequential zone.
With this case, blktrace showed that the 8MB write was split into multiple BIOs
(expected) and the whole 8MB being cleanly written sequentially. But this was
followed by a stream of small 4K writes starting around the end of the 8MB
chunk, but within it, causing unaligned write errors (overwrite in sequential
zones not being possible).

dd if=/dev/zero of=/dev/nullb0 bs=8M oflag=direct count=1

On a nullblk disk in zoned mode should recreate the problem, and blktrace
revealing that dd sees a short write for the 8M IO and issues remaining as 4K
requests.

Using a regular disk, this however does not generate any error at all, which may
explain why you did not see any problem. I think we can create a blktest case
for this using nullblk in zoned mode. Would you like us to send one ?

Best regards.

-- 
Damien Le Moal
Western Digital Research