Re: [PATCH] io_uring/rw: transform single vector readv/writev into ubuf

Ming Lei <ming.lei@xxxxxxxxxx> · Sat, 25 Mar 2023 06:41:41 +0800

On Fri, Mar 24, 2023 at 08:35:38AM -0600, Jens Axboe wrote:
> It's very common to have applications that use vectored reads or writes,
> even if they only pass in a single segment. Obviously they should be
> using read/write at that point, but...

Yeah, it is like fixing application issue in kernel side, :-)

> 
> Vectored IO comes with the downside of needing to retain iovec state,
> and hence they require and allocation and state copy if they end up
> getting deferred. Additionally, they also require extra cleanup when
> completed as the memory as the allocated state memory has to be freed.
> 
> Automatically transform single segment IORING_OP_{READV,WRITEV} into
> IORING_OP_{READ,WRITE}, and hence into an ITER_UBUF. Outside of being
> more efficient if needing deferral, ITER_UBUF is also more efficient
> for normal processing compared to ITER_IOVEC, as they don't require
> iteration. The latter is apparent when running peak testing, where
> using IORING_OP_READV to randomly read 24 drives previously scored:
> 
> IOPS=72.54M, BW=35.42GiB/s, IOS/call=32/31
> IOPS=73.35M, BW=35.81GiB/s, IOS/call=32/31
> IOPS=72.71M, BW=35.50GiB/s, IOS/call=32/31
> IOPS=73.29M, BW=35.78GiB/s, IOS/call=32/32
> IOPS=73.45M, BW=35.86GiB/s, IOS/call=32/32
> IOPS=73.19M, BW=35.74GiB/s, IOS/call=31/32
> IOPS=72.89M, BW=35.59GiB/s, IOS/call=32/31
> IOPS=73.07M, BW=35.68GiB/s, IOS/call=32/32
> 
> and after the change we get:
> 
> IOPS=77.31M, BW=37.75GiB/s, IOS/call=32/31
> IOPS=77.32M, BW=37.75GiB/s, IOS/call=32/32
> IOPS=77.45M, BW=37.81GiB/s, IOS/call=31/31
> IOPS=77.47M, BW=37.83GiB/s, IOS/call=32/32
> IOPS=77.14M, BW=37.67GiB/s, IOS/call=32/32
> IOPS=77.14M, BW=37.66GiB/s, IOS/call=31/31
> IOPS=77.37M, BW=37.78GiB/s, IOS/call=32/32
> IOPS=77.25M, BW=37.72GiB/s, IOS/call=32/32
> 
> which is a nice win as well.
> 
> Signed-off-by: Jens Axboe <axboe@xxxxxxxxx>
> 
> ---
> 
> diff --git a/io_uring/rw.c b/io_uring/rw.c
> index 4c233910e200..5c998754cb17 100644
> --- a/io_uring/rw.c
> +++ b/io_uring/rw.c
> @@ -402,7 +402,22 @@ static struct iovec *__io_import_iovec(int ddir, struct io_kiocb *req,
>  			      req->ctx->compat);
>  	if (unlikely(ret < 0))
>  		return ERR_PTR(ret);
> -	return iovec;
> +	if (iter->nr_segs != 1)
> +		return iovec;
> +	/*
> +	 * Convert to non-vectored request if we have a single segment. If we
> +	 * need to defer the request, then we no longer have to allocate and
> +	 * maintain a struct io_async_rw. Additionally, we won't have cleanup
> +	 * to do at completion time
> +	 */
> +	rw->addr = (unsigned long) iter->iov[0].iov_base;
> +	rw->len = iter->iov[0].iov_len;
> +	iov_iter_ubuf(iter, ddir, iter->iov[0].iov_base, rw->len);
> +	/* readv -> read distance is the same as writev -> write */
> +	BUILD_BUG_ON((IORING_OP_READ - IORING_OP_READV) !=
> +			(IORING_OP_WRITE - IORING_OP_WRITEV));
> +	req->opcode += (IORING_OP_READ - IORING_OP_READV);

It is a bit fragile to change ->opcode, which may need matched callbacks for
the two OPs, also cause inconsistent opcode in traces.

I am wondering why not play the magic in io_prep_rw() from beginning?

Thanks,
Ming