On Tue, Mar 04, 2025 at 10:47:48PM +0000, Pavel Begunkov wrote: > On 3/4/25 21:11, Dave Chinner wrote: > > > @@ -374,6 +379,23 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter, > > > if (!(dio->flags & (IOMAP_DIO_INLINE_COMP|IOMAP_DIO_CALLER_COMP))) > > > dio->iocb->ki_flags &= ~IOCB_HIPRI; > > > + bio_opf = iomap_dio_bio_opflags(dio, iomap, use_fua, atomic); > > > + > > > + if (!is_sync_kiocb(dio->iocb) && (dio->iocb->ki_flags & IOCB_NOWAIT)) { > > > + /* > > > + * This is nonblocking IO, and we might need to allocate > > > + * multiple bios. In this case, as we cannot guarantee that > > > + * one of the sub bios will not fail getting issued FOR NOWAIT > > > + * and as error results are coalesced across all of them, ask > > > + * for a retry of this from blocking context. > > > + */ > > > + if (bio_iov_vecs_to_alloc(dio->submit.iter, BIO_MAX_VECS + 1) > > > > + BIO_MAX_VECS) > > > + return -EAGAIN; > > > + > > > + bio_opf |= REQ_NOWAIT; > > > + } > > > > Ok, so this allows a max sized bio to be used. So, what, 1MB on 4kB > > page size is the maximum IO size for IOCB_NOWAIT now? I bet that's > > not documented anywhere.... > > > > Ah. This doesn't fix the problem at all. > > > > Say, for exmaple, I have a hardware storage device with a max > > hardware IO size of 128kB. This is from the workstation I'm typing > > this email on: > > > > $ cat /sys/block/nvme0n1/queue/max_hw_sectors_kb > > 128 > > $ cat /sys/block/nvme0n1/queue/max_segments > > 33 > > $ > > > > We build a 1MB bio above, set REQ_NOWAIT, then: > > > > submit_bio > > .... > > blk_mq_submit_bio > > __bio_split_to_limits(bio, &q->limits, &nr_segs); > > bio_split_rw() > > ..... > > split: > > ..... > > /* > > * We can't sanely support splitting for a REQ_NOWAIT bio. End it > > * with EAGAIN if splitting is required and return an error pointer. > > */ > > if (bio->bi_opf & REQ_NOWAIT) > > return -EAGAIN; > > > > > > So, REQ_NOWAIT effectively limits bio submission to the maximum > > single IO size of the underlying storage. So, we can't use > > REQ_NOWAIT without actually looking at request queue limits before > > we start building the IO - yuk. > > > > REQ_NOWAIT still feels completely unusable to me.... > > Not great but better than not using the async path at all (i.e. executing > by a kernel worker) as far as io_uring concerned. I really don't care about what io_uring thinks or does. If the block layer REQ_NOWAIT semantics are unusable for non-blocking IO submission, then that's the problem that needs fixing. This isn't a problem we can (or should) try to work around in the iomap layer. > It should cover a good > share of users, and io_uring has a fallback path, so userspace won't even > know about the error. The overhead per byte is less biting for 128K as well. That 128kb limit I demonstrated is not a fixed number. Stacking block devices (e.g. software RAID devices) can split bios at any sector offset within the submitted bio. For example: we have RAID5 witha 64kB chunk size, so max REQ_NOWAIT io size is 64kB according to the queue limits. However, if we do a 64kB IO at a 60kB chunk offset, that bio is going to be split into a 4kB bio and a 60kB bio because they are issued to different physical devices..... There is no way the bio submitter can know that this behaviour will occur, nor should they even be attempting to predict when/if such splitting may occur. > The patch already limits the change to async users only, but if you're > concerned about non-io_uring users, I can try to narrow it to io_uring > requests only, even though I don't think it's a good way. I'm concerned about any calling context that sets IOCB_NOWAIT. I don't care if it's io_uring, AIO or a synchronous preadv2(RWF_NOWAIT) call: we still have the same issue that REQ_NOWAIT submission side -EAGAIN is only reported through IO completion callbacks. > Are you only concerned about the size being too restrictive or do you > see any other problems? I'm concerned abou the fact that REQ_NOWAIT is not usable as it stands. We've identified bio chaining as an issue, now bio splitting is an issue, and I'm sure if we look further there will be other cases that are issues (e.g. bounce buffers). The underlying problem here is that bio submission errors are reported through bio completion mechanisms, not directly back to the submitting context. Fix that problem in the block layer API, and then iomap can use REQ_NOWAIT without having to care about what the block layer is doing under the covers. -Dave. -- Dave Chinner david@xxxxxxxxxxxxx