Re: [PATCH 1/1] iomap: propagate nowait to block layer

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 26 Feb 2025 15:52:34 +1100

On Wed, Feb 26, 2025 at 01:33:58AM +0000, Pavel Begunkov wrote:
> There are reports of high io_uring submission latency for ext4 and xfs,
> which is due to iomap not propagating nowait flag to the block layer
> resulting in waiting for IO during tag allocation.
> 
> Cc: stable@xxxxxxxxxxxxxxx
> Link: https://github.com/axboe/liburing/issues/826#issuecomment-2674131870
> Reported-by: wu lei <uwydoc@xxxxxxxxx>
> Signed-off-by: Pavel Begunkov <asml.silence@xxxxxxxxx>
> ---
>  fs/iomap/direct-io.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index b521eb15759e..25c5e87dbd94 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -81,6 +81,9 @@ static void iomap_dio_submit_bio(const struct iomap_iter *iter,
>  		WRITE_ONCE(iocb->private, bio);
>  	}
>  
> +	if (iocb->ki_flags & IOCB_NOWAIT)
> +		bio->bi_opf |= REQ_NOWAIT;

ISTR that this was omitted on purpose because REQ_NOWAIT doesn't
work in the way iomap filesystems expect IO to behave.

I think it has to do with large direct IOs that require multiple
calls to submit_bio(). Each bio that is allocated and submitted
takes a reference to the iomap_dio object, and the iomap_dio is not
completed until that reference count goes to zero.

hence if we have submitted a series of bios in a IOCB_NOWAIT DIO
and then the next bio submission in the DIO triggers a REQ_NOWAIT
condition, that bio is marked with a BLK_STS_AGAIN and completed.
This error is then caught by the iomap dio bio completion function,
recorded in the iomap_dio structure, but because there is still
bios in flight, the iomap_dio ref count does not fall to zero and so
the DIO itself is not completed.

Then submission loops again, sees dio->error is set and aborts
submission. Because this is AIO, and the iomap_dio refcount is
non-zero at this point, __iomap_dio_rw() returns -EIOCBQUEUED.
It does not return the -EAGAIN state that was reported to bio
completion because the overall DIO has not yet been completed
and all the IO completion status gathered.

Hence when the in flight async bios actually complete, they drop the
iomap dio reference count to zero, iomap_dio_complete() is called,
and the BLK_STS_AGAIN error is gathered from the previous submission
failure. This then calls AIO completion, and reports a -EAGAIN error
to the AIO/io_uring completion code.

IOWs, -EAGAIN is *not reported to the IO submitter* that needs
this information to defer and resubmit the IO - it is reported to IO
completion where it is completely useless and, most likely, not in a
context that can resubmit the IO.

Put simply: any code that submits multiple bios (either individually
or as a bio chain) for a single high level IO can not use REQ_NOWAIT
reliably for async IO submission.

We have similar limitations on IO polling (IOCB_HIPRI) in iomap, but
I'm not sure if REQ_NOWAIT can be handled the same way. i.e. only
setting REQ_NOWAIT on the first bio means that the second+ bio can
still block and cause latency issues.

So, yeah, fixing this source of latency is not as simple as just
setting REQ_NOWAIT. I don't know if there is a better solution that
what we currently have, but causing large AIO DIOs to
randomly fail with EAGAIN reported at IO completion (with the likely
result of unexpected data corruption) is far worse behaviour that
occasionally having to deal with a long IO submission latency.

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx