Re: [PATCH v2 1/1] iomap: propagate nowait to block layer

Christoph Hellwig <hch@xxxxxxxxxxxxx> · Wed, 5 Mar 2025 06:10:59 -0800

On Wed, Mar 05, 2025 at 12:19:46PM +1100, Dave Chinner wrote:
> I really don't care about what io_uring thinks or does. If the block
> layer REQ_NOWAIT semantics are unusable for non-blocking IO
> submission, then that's the problem that needs fixing. This isn't a
> problem we can (or should) try to work around in the iomap layer.

Agreed.  The problem are the block layer semantics.  iomap/xfs really
just is the messenger here.

> For example: we have RAID5 witha 64kB chunk size, so max REQ_NOWAIT
> io size is 64kB according to the queue limits. However, if we do a
> 64kB IO at a 60kB chunk offset, that bio is going to be split into a
> 4kB bio and a 60kB bio because they are issued to different physical
> devices.....
> 
> There is no way the bio submitter can know that this behaviour will
> occur, nor should they even be attempting to predict when/if such
> splitting may occur.

And for something that has a real block allocator it could also be
entirely dynamic.  But I'm not sure if dm-thinp or bcache do anything
like that at the moment.

> > Are you only concerned about the size being too restrictive or do you
> > see any other problems?
> 
> I'm concerned abou the fact that REQ_NOWAIT is not usable as it
> stands. We've identified bio chaining as an issue, now bio splitting
> is an issue, and I'm sure if we look further there will be other
> cases that are issues (e.g. bounce buffers).
> 
> The underlying problem here is that bio submission errors are
> reported through bio completion mechanisms, not directly back to the
> submitting context. Fix that problem in the block layer API, and
> then iomap can use REQ_NOWAIT without having to care about what the
> block layer is doing under the covers.

Exactly.  Either they need to be reported synchronously, or maybe we
need a block layer hook in bio_endio that retries the given bio on a
workqueue without ever bubbling up to the caller.  But allowing delayed
BLK_STS_AGAIN is going to mess up any non-trivial caller.  But even
for the plain block device is will cause duplicate I/O where some
blocks have already been read/written and then will get resubmitted.

I'm not sure that breaks any atomicity assumptions as we don't really
give explicit ones for block devices (except maybe for the new
RWF_ATOMIC flag?), but it certainly is unexpected and suboptimal.