Re: [PATCH 15/17] io_uring/uring_cmd: defer SQE copying until we need it

Anuj gupta <anuj1072538@xxxxxxxxx> · Mon, 25 Mar 2024 18:11:10 +0530

On Thu, Mar 21, 2024 at 4:28 AM Jens Axboe <axboe@xxxxxxxxx> wrote:
>
> The previous commit turned on async data for uring_cmd, and did the
> basic conversion of setting everything up on the prep side. However, for
> a lot of use cases, we'll get -EIOCBQUEUED on issue, which means we do
> not need a persistent big SQE copied.
>
> Unless we're going async immediately, defer copying the double SQE until
> we know we have to.
>
> This greatly reduces the overhead of such commands, as evidenced by
> a perf diff from before and after this change:
>
>     10.60%     -8.58%  [kernel.vmlinux]  [k] io_uring_cmd_prep
>
> where the prep side drops from 10.60% to ~2%, which is more expected.
> Performance also rises from ~113M IOPS to ~122M IOPS, bringing us back
> to where it was before the async command prep.
>
> Signed-off-by: Jens Axboe <axboe@xxxxxxxxx>
>
> ~# Last command done (1 command done):
> ---
>  io_uring/uring_cmd.c | 25 +++++++++++++++++++------
>  1 file changed, 19 insertions(+), 6 deletions(-)
>
> diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
> index 9bd0ba87553f..92346b5d9f5b 100644
> --- a/io_uring/uring_cmd.c
> +++ b/io_uring/uring_cmd.c
> @@ -182,12 +182,18 @@ static int io_uring_cmd_prep_setup(struct io_kiocb *req,
>         struct uring_cache *cache;
>
>         cache = io_uring_async_get(req);
> -       if (cache) {
> -               memcpy(cache->sqes, sqe, uring_sqe_size(req->ctx));
> -               ioucmd->sqe = req->async_data;
> +       if (unlikely(!cache))
> +               return -ENOMEM;
> +
> +       if (!(req->flags & REQ_F_FORCE_ASYNC)) {
> +               /* defer memcpy until we need it */
> +               ioucmd->sqe = sqe;
>                 return 0;
>         }
> -       return -ENOMEM;
> +
> +       memcpy(req->async_data, sqe, uring_sqe_size(req->ctx));
> +       ioucmd->sqe = req->async_data;
> +       return 0;
>  }
>
>  int io_uring_cmd_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
> @@ -245,8 +251,15 @@ int io_uring_cmd(struct io_kiocb *req, unsigned int issue_flags)
>         }
>
>         ret = file->f_op->uring_cmd(ioucmd, issue_flags);
> -       if (ret == -EAGAIN || ret == -EIOCBQUEUED)
> -               return ret;
> +       if (ret == -EAGAIN) {
> +               struct uring_cache *cache = req->async_data;
> +
> +               if (ioucmd->sqe != (void *) cache)
> +                       memcpy(cache, ioucmd->sqe, uring_sqe_size(req->ctx));
> +               return -EAGAIN;
> +       } else if (ret == -EIOCBQUEUED) {
> +               return -EIOCBQUEUED;
> +       }
>
>         if (ret < 0)
>                 req_set_fail(req);
> --
> 2.43.0
>
>

The io_uring_cmd plumbing part of this series looks good to me.
I tested it with io_uring nvme-passthrough on my setup with two
optanes and there is no drop in performance as well [1].
For this and the previous patch,

Tested-by: Anuj Gupta <anuj20.g@xxxxxxxxxxx>
Reviewed-by: Anuj Gupta <anuj20.g@xxxxxxxxxxx>

[1]
# taskset -c 2,5 t/io_uring -b512 -d128 -c32 -s32 -p1 -O0 -F1 -B1 -u1
-n2 -r4 /dev/ng0n1 /dev/ng2n1
submitter=1, tid=7166, file=/dev/ng2n1, nfiles=1, node=-1
submitter=0, tid=7165, file=/dev/ng0n1, nfiles=1, node=-1
polled=1, fixedbufs=1, register_files=1, buffered=1, QD=128
Engine=io_uring, sq_ring=128, cq_ring=128
IOPS=10.02M, BW=4.89GiB/s, IOS/call=31/31
IOPS=10.04M, BW=4.90GiB/s, IOS/call=31/31
IOPS=10.04M, BW=4.90GiB/s, IOS/call=31/31
Exiting on timeout
Maximum IOPS=10.04M
--
Anuj Gupta