On Fri, Mar 24, 2023 at 08:35:38AM -0600, Jens Axboe wrote: > It's very common to have applications that use vectored reads or writes, > even if they only pass in a single segment. Obviously they should be > using read/write at that point, but... Yeah, it is like fixing application issue in kernel side, :-) > > Vectored IO comes with the downside of needing to retain iovec state, > and hence they require and allocation and state copy if they end up > getting deferred. Additionally, they also require extra cleanup when > completed as the memory as the allocated state memory has to be freed. > > Automatically transform single segment IORING_OP_{READV,WRITEV} into > IORING_OP_{READ,WRITE}, and hence into an ITER_UBUF. Outside of being > more efficient if needing deferral, ITER_UBUF is also more efficient > for normal processing compared to ITER_IOVEC, as they don't require > iteration. The latter is apparent when running peak testing, where > using IORING_OP_READV to randomly read 24 drives previously scored: > > IOPS=72.54M, BW=35.42GiB/s, IOS/call=32/31 > IOPS=73.35M, BW=35.81GiB/s, IOS/call=32/31 > IOPS=72.71M, BW=35.50GiB/s, IOS/call=32/31 > IOPS=73.29M, BW=35.78GiB/s, IOS/call=32/32 > IOPS=73.45M, BW=35.86GiB/s, IOS/call=32/32 > IOPS=73.19M, BW=35.74GiB/s, IOS/call=31/32 > IOPS=72.89M, BW=35.59GiB/s, IOS/call=32/31 > IOPS=73.07M, BW=35.68GiB/s, IOS/call=32/32 > > and after the change we get: > > IOPS=77.31M, BW=37.75GiB/s, IOS/call=32/31 > IOPS=77.32M, BW=37.75GiB/s, IOS/call=32/32 > IOPS=77.45M, BW=37.81GiB/s, IOS/call=31/31 > IOPS=77.47M, BW=37.83GiB/s, IOS/call=32/32 > IOPS=77.14M, BW=37.67GiB/s, IOS/call=32/32 > IOPS=77.14M, BW=37.66GiB/s, IOS/call=31/31 > IOPS=77.37M, BW=37.78GiB/s, IOS/call=32/32 > IOPS=77.25M, BW=37.72GiB/s, IOS/call=32/32 > > which is a nice win as well. > > Signed-off-by: Jens Axboe <axboe@xxxxxxxxx> > > --- > > diff --git a/io_uring/rw.c b/io_uring/rw.c > index 4c233910e200..5c998754cb17 100644 > --- a/io_uring/rw.c > +++ b/io_uring/rw.c > @@ -402,7 +402,22 @@ static struct iovec *__io_import_iovec(int ddir, struct io_kiocb *req, > req->ctx->compat); > if (unlikely(ret < 0)) > return ERR_PTR(ret); > - return iovec; > + if (iter->nr_segs != 1) > + return iovec; > + /* > + * Convert to non-vectored request if we have a single segment. If we > + * need to defer the request, then we no longer have to allocate and > + * maintain a struct io_async_rw. Additionally, we won't have cleanup > + * to do at completion time > + */ > + rw->addr = (unsigned long) iter->iov[0].iov_base; > + rw->len = iter->iov[0].iov_len; > + iov_iter_ubuf(iter, ddir, iter->iov[0].iov_base, rw->len); > + /* readv -> read distance is the same as writev -> write */ > + BUILD_BUG_ON((IORING_OP_READ - IORING_OP_READV) != > + (IORING_OP_WRITE - IORING_OP_WRITEV)); > + req->opcode += (IORING_OP_READ - IORING_OP_READV); It is a bit fragile to change ->opcode, which may need matched callbacks for the two OPs, also cause inconsistent opcode in traces. I am wondering why not play the magic in io_prep_rw() from beginning? Thanks, Ming