On Wed, Mar 05, 2025 at 12:19:46PM +1100, Dave Chinner wrote: > I really don't care about what io_uring thinks or does. If the block > layer REQ_NOWAIT semantics are unusable for non-blocking IO > submission, then that's the problem that needs fixing. This isn't a > problem we can (or should) try to work around in the iomap layer. Agreed. The problem are the block layer semantics. iomap/xfs really just is the messenger here. > For example: we have RAID5 witha 64kB chunk size, so max REQ_NOWAIT > io size is 64kB according to the queue limits. However, if we do a > 64kB IO at a 60kB chunk offset, that bio is going to be split into a > 4kB bio and a 60kB bio because they are issued to different physical > devices..... > > There is no way the bio submitter can know that this behaviour will > occur, nor should they even be attempting to predict when/if such > splitting may occur. And for something that has a real block allocator it could also be entirely dynamic. But I'm not sure if dm-thinp or bcache do anything like that at the moment. > > Are you only concerned about the size being too restrictive or do you > > see any other problems? > > I'm concerned abou the fact that REQ_NOWAIT is not usable as it > stands. We've identified bio chaining as an issue, now bio splitting > is an issue, and I'm sure if we look further there will be other > cases that are issues (e.g. bounce buffers). > > The underlying problem here is that bio submission errors are > reported through bio completion mechanisms, not directly back to the > submitting context. Fix that problem in the block layer API, and > then iomap can use REQ_NOWAIT without having to care about what the > block layer is doing under the covers. Exactly. Either they need to be reported synchronously, or maybe we need a block layer hook in bio_endio that retries the given bio on a workqueue without ever bubbling up to the caller. But allowing delayed BLK_STS_AGAIN is going to mess up any non-trivial caller. But even for the plain block device is will cause duplicate I/O where some blocks have already been read/written and then will get resubmitted. I'm not sure that breaks any atomicity assumptions as we don't really give explicit ones for block devices (except maybe for the new RWF_ATOMIC flag?), but it certainly is unexpected and suboptimal.