Re: [PATCH v3 RESEND] iomap: set REQ_NOWAIT according to IOCB_NOWAIT in Direct IO

JeffleXu <jefflexu@xxxxxxxxxxxxxxxxx> · Thu, 10 Dec 2020 10:33:59 +0800

On 12/10/20 5:15 AM, Dave Chinner wrote:
> On Mon, Dec 07, 2020 at 04:40:38PM -0700, Jens Axboe wrote:
>> On 12/6/20 7:21 PM, Dave Chinner wrote:
>>> On Fri, Dec 04, 2020 at 05:44:56PM +0800, Hao Xu wrote:
>>>> Currently, IOCB_NOWAIT is ignored in Direct IO, REQ_NOWAIT is only set
>>>> when IOCB_HIPRI is set. But REQ_NOWAIT should be set as well when
>>>> IOCB_NOWAIT is set.
>>>>
>>>> Suggested-by: Jeffle Xu <jefflexu@xxxxxxxxxxxxxxxxx>
>>>> Signed-off-by: Konstantin Khlebnikov <khlebnikov@xxxxxxxxxxxxxx>
>>>> Signed-off-by: Hao Xu <haoxu@xxxxxxxxxxxxxxxxx>
>>>> ---
>>>>
>>>> Hi all,
>>>> I tested fio io_uring direct read for a file on ext4 filesystem on a
>>>> nvme ssd. I found that IOCB_NOWAIT is ignored in iomap layer, which
>>>> means REQ_NOWAIT is not set in bio->bi_opf.
>>>
>>> What iomap is doing is correct behaviour. IOCB_NOWAIT applies to the
>>> filesystem behaviour, not the block device.
>>>
>>> REQ_NOWAIT can result in partial IO failures because the error is
>>> only reported to the iomap layer via IO completions. Hence we can
>>> split a DIO into multiple bios and have random bios in that IO fail
>>> with EAGAIN because REQ_NOWAIT is set. This error will
>>> get reported to the submitter via completion, and it will override
>>> any of the partial IOs that actually completed.
>>>
>>> Hence, like the recently reported multi-mapping IOCB_NOWAIT bug
>>> reported by Jens and fixed in commit 883a790a8440 ("xfs: don't allow
>>> NOWAIT DIO across extent boundaries") we'll get silent partial
>>> writes occurring because the second submitted bio in an IO can
>>> trigger EAGAIN errors with partial IO completion having already
>>> occurred.
>>>
>>> Further, we don't allow partial IO completion for DIO on XFS at all.
>>> DIO must be completely submitted and completed or return an error
>>> without having issued any IO at all.  Hence using REQ_NOWAIT for
>>> DIO bios is incorrect and not desirable.
>>
>> What you say makes total sense for a user using RWF_NOWAIT, but it
>> doesn't make a lot of sense for io_uring where we really want
>> IOCB_NOWAIT to be what it suggests it is - don't wait for other IO to
>> complete, if avoidable. One of the things that really suck with
>> aio/libai is the "yeah it's probably async, but lol, might not be"
>> aspect of it.
> 
> Sure, but we have no way of telling what semantics the IO issuer
> actually requires from above. And because IOCB_NOWAIT behaviour is
> directly exposed to userspace by RWF_NOWAIT, that's the behaviour we
> have to implement.
> 
>> For io_uring, if we do get -EAGAIN, we'll retry without NOWAIT set. So
>> the concern about fractured/short writes doesn't bubble up to the
>> application. Hence we really want an IOCB_NOWAIT_REALLY on that side,
>> instead of the poor mans IOCB_MAYBE_NOWAIT semantics.
> 
> Yup, perhaps what we really want is a true IOCB_NONBLOCK flag as an
> internal kernel implementation. i.e. don't block anywhere in the
> stack, and the caller must handle retrying/completing the entire IO
> regardless of where the -EAGAIN comes from during the IO, including
> from a partial completion that the caller is waiting for...
> 
> i.e. rather than hacking around this specific instance of "it blocks
> and we don't want it to", define the semantics and behaviour of a
> fully non-blocking IO through all layers from the VFS down to the
> hardware and let's implement that. Then we can stop playing
> whack-a-mole with all the "but it blocks when I do this, doctor!"
> issues that we seem to keep having.... :)
> 

>From my opinion, the semantics of NOWAIT is clear, just like Jens said,
"don't wait, if**avoidable**". Also it should be the responsibility of
the callers who set this flag to resubmit the failed IO once a -EAGAIN
returned. e.g. if the user app sets RWF_NOWAIT, then it's the
responsibility of the user app to resubmit the failed IO once -EAGAIN
returned. If io_uring layer newly set IOCB_NONBLOCK flag, then it's the
responsibility of io_uring to resubmit.

The limitation here is that once a big DIO/bio got split, then the
atomicity of whole DIO/bio can not be guaranteed. This seems a
limitation of direct IO routine and block layer, and the work of above
layers e.g. filesystem, is needed to guarantee the atomicity.

-- 
Thanks,
Jeffle