On 12/10/20 5:15 AM, Dave Chinner wrote: > On Mon, Dec 07, 2020 at 04:40:38PM -0700, Jens Axboe wrote: >> On 12/6/20 7:21 PM, Dave Chinner wrote: >>> On Fri, Dec 04, 2020 at 05:44:56PM +0800, Hao Xu wrote: >>>> Currently, IOCB_NOWAIT is ignored in Direct IO, REQ_NOWAIT is only set >>>> when IOCB_HIPRI is set. But REQ_NOWAIT should be set as well when >>>> IOCB_NOWAIT is set. >>>> >>>> Suggested-by: Jeffle Xu <jefflexu@xxxxxxxxxxxxxxxxx> >>>> Signed-off-by: Konstantin Khlebnikov <khlebnikov@xxxxxxxxxxxxxx> >>>> Signed-off-by: Hao Xu <haoxu@xxxxxxxxxxxxxxxxx> >>>> --- >>>> >>>> Hi all, >>>> I tested fio io_uring direct read for a file on ext4 filesystem on a >>>> nvme ssd. I found that IOCB_NOWAIT is ignored in iomap layer, which >>>> means REQ_NOWAIT is not set in bio->bi_opf. >>> >>> What iomap is doing is correct behaviour. IOCB_NOWAIT applies to the >>> filesystem behaviour, not the block device. >>> >>> REQ_NOWAIT can result in partial IO failures because the error is >>> only reported to the iomap layer via IO completions. Hence we can >>> split a DIO into multiple bios and have random bios in that IO fail >>> with EAGAIN because REQ_NOWAIT is set. This error will >>> get reported to the submitter via completion, and it will override >>> any of the partial IOs that actually completed. >>> >>> Hence, like the recently reported multi-mapping IOCB_NOWAIT bug >>> reported by Jens and fixed in commit 883a790a8440 ("xfs: don't allow >>> NOWAIT DIO across extent boundaries") we'll get silent partial >>> writes occurring because the second submitted bio in an IO can >>> trigger EAGAIN errors with partial IO completion having already >>> occurred. >>> >>> Further, we don't allow partial IO completion for DIO on XFS at all. >>> DIO must be completely submitted and completed or return an error >>> without having issued any IO at all. Hence using REQ_NOWAIT for >>> DIO bios is incorrect and not desirable. >> >> What you say makes total sense for a user using RWF_NOWAIT, but it >> doesn't make a lot of sense for io_uring where we really want >> IOCB_NOWAIT to be what it suggests it is - don't wait for other IO to >> complete, if avoidable. One of the things that really suck with >> aio/libai is the "yeah it's probably async, but lol, might not be" >> aspect of it. > > Sure, but we have no way of telling what semantics the IO issuer > actually requires from above. And because IOCB_NOWAIT behaviour is > directly exposed to userspace by RWF_NOWAIT, that's the behaviour we > have to implement. > >> For io_uring, if we do get -EAGAIN, we'll retry without NOWAIT set. So >> the concern about fractured/short writes doesn't bubble up to the >> application. Hence we really want an IOCB_NOWAIT_REALLY on that side, >> instead of the poor mans IOCB_MAYBE_NOWAIT semantics. > > Yup, perhaps what we really want is a true IOCB_NONBLOCK flag as an > internal kernel implementation. i.e. don't block anywhere in the > stack, and the caller must handle retrying/completing the entire IO > regardless of where the -EAGAIN comes from during the IO, including > from a partial completion that the caller is waiting for... > > i.e. rather than hacking around this specific instance of "it blocks > and we don't want it to", define the semantics and behaviour of a > fully non-blocking IO through all layers from the VFS down to the > hardware and let's implement that. Then we can stop playing > whack-a-mole with all the "but it blocks when I do this, doctor!" > issues that we seem to keep having.... :) > >From my opinion, the semantics of NOWAIT is clear, just like Jens said, "don't wait, if**avoidable**". Also it should be the responsibility of the callers who set this flag to resubmit the failed IO once a -EAGAIN returned. e.g. if the user app sets RWF_NOWAIT, then it's the responsibility of the user app to resubmit the failed IO once -EAGAIN returned. If io_uring layer newly set IOCB_NONBLOCK flag, then it's the responsibility of io_uring to resubmit. The limitation here is that once a big DIO/bio got split, then the atomicity of whole DIO/bio can not be guaranteed. This seems a limitation of direct IO routine and block layer, and the work of above layers e.g. filesystem, is needed to guarantee the atomicity. -- Thanks, Jeffle