On 9/2/22 1:32 PM, Jens Axboe wrote: > On 9/2/22 12:46 PM, Kanchan Joshi wrote: >> On Fri, Sep 02, 2022 at 10:32:16AM -0600, Jens Axboe wrote: >>> On 9/2/22 10:06 AM, Jens Axboe wrote: >>>> On 9/2/22 9:16 AM, Kanchan Joshi wrote: >>>>> Hi, >>>>> >>>>> Currently uring-cmd lacks the ability to leverage the pre-registered >>>>> buffers. This series adds the support in uring-cmd, and plumbs >>>>> nvme passthrough to work with it. >>>>> >>>>> Using registered-buffers showed peak-perf hike from 1.85M to 2.17M IOPS >>>>> in my setup. >>>>> >>>>> Without fixedbufs >>>>> ***************** >>>>> # taskset -c 0 t/io_uring -b512 -d128 -c32 -s32 -p0 -F1 -B0 -O0 -n1 -u1 /dev/ng0n1 >>>>> submitter=0, tid=5256, file=/dev/ng0n1, node=-1 >>>>> polled=0, fixedbufs=0/0, register_files=1, buffered=1, QD=128 >>>>> Engine=io_uring, sq_ring=128, cq_ring=128 >>>>> IOPS=1.85M, BW=904MiB/s, IOS/call=32/31 >>>>> IOPS=1.85M, BW=903MiB/s, IOS/call=32/32 >>>>> IOPS=1.85M, BW=902MiB/s, IOS/call=32/32 >>>>> ^CExiting on signal >>>>> Maximum IOPS=1.85M >>>> >>>> With the poll support queued up, I ran this one as well. tldr is: >>>> >>>> bdev (non pt)??? 122M IOPS >>>> irq driven??? 51-52M IOPS >>>> polled??????? 71M IOPS >>>> polled+fixed??? 78M IOPS >> >> except first one, rest three entries are for passthru? somehow I didn't >> see that big of a gap. I will try to align my setup in coming days. > > Right, sorry it was badly labeled. First one is bdev with polling, > registered buffers, etc. The others are all the passthrough mode. polled > goes to 74M with the caching fix, so it's about a 74M -> 82M bump using > registered buffers with passthrough and polling. > >>> polled+fixed??? 82M >>> >>> I suspect the remainder is due to the lack of batching on the request >>> freeing side, at least some of it. Haven't really looked deeper yet. >>> >>> One issue I saw - try and use passthrough polling without having any >>> poll queues defined and it'll stall just spinning on completions. You >>> need to ensure that these are processed as well - look at how the >>> non-passthrough io_uring poll path handles it. >> >> Had tested this earlier, and it used to run fine. And it does not now. >> I see that io are getting completed, irq-completion is arriving in nvme >> and it is triggering task-work based completion (by calling >> io_uring_cmd_complete_in_task). But task-work never got called and >> therefore no completion happened. >> >> io_uring_cmd_complete_in_task -> io_req_task_work_add -> __io_req_task_work_add >> >> Seems task work did not get added. Something about newly added >> IORING_SETUP_DEFER_TASKRUN changes the scenario. >> >> static inline void __io_req_task_work_add(struct io_kiocb *req, bool allow_local) >> { >> ?????? struct io_uring_task *tctx = req->task->io_uring; >> ?????? struct io_ring_ctx *ctx = req->ctx; >> ?????? struct llist_node *node; >> >> ?????? if (allow_local && ctx->flags & IORING_SETUP_DEFER_TASKRUN) { >> ?????????????? io_req_local_work_add(req); >> ?????????????? return; >> ?????? } >> ????.... >> >> To confirm, I commented that in t/io_uring and it runs fine. >> Please see if that changes anything for you? I will try to find the >> actual fix tomorow. > > Ah gotcha, yes that actually makes a lot of sense. I wonder if regular > polling is then also broken without poll queues if > IORING_SETUP_DEFER_TASKRUN is set. It should be, I'll check into > io_iopoll_check(). A mix of fixes and just cleanups, here's what I got. -- Jens Axboe
From 50155186644a352b290b72c61e738f62640d566a Mon Sep 17 00:00:00 2001 From: Jens Axboe <axboe@xxxxxxxxx> Date: Fri, 2 Sep 2022 15:16:29 -0600 Subject: [PATCH 1/3] io_uring: cleanly separate request types for iopoll After the addition of iopoll support for passthrough, there's a bit of a mixup here. Clean it up and get rid of the casting for the passthrough command type. Signed-off-by: Jens Axboe <axboe@xxxxxxxxx> --- io_uring/rw.c | 15 +++++++++------ 1 file changed, 9 insertions(+), 6 deletions(-) diff --git a/io_uring/rw.c b/io_uring/rw.c index 9698a789b3d5..3f03b6d2a5a3 100644 --- a/io_uring/rw.c +++ b/io_uring/rw.c @@ -994,7 +994,7 @@ int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin) wq_list_for_each(pos, start, &ctx->iopoll_list) { struct io_kiocb *req = container_of(pos, struct io_kiocb, comp_list); - struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw); + struct file *file = req->file; int ret; /* @@ -1006,12 +1006,15 @@ int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin) break; if (req->opcode == IORING_OP_URING_CMD) { - struct io_uring_cmd *ioucmd = (struct io_uring_cmd *)rw; + struct io_uring_cmd *ioucmd; - ret = req->file->f_op->uring_cmd_iopoll(ioucmd); - } else - ret = rw->kiocb.ki_filp->f_op->iopoll(&rw->kiocb, &iob, - poll_flags); + ioucmd = io_kiocb_to_cmd(req, struct io_uring_cmd); + ret = file->f_op->uring_cmd_iopoll(ioucmd, poll_flags); + } else { + struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw); + + ret = file->f_op->iopoll(&rw->kiocb, &iob, poll_flags); + } if (unlikely(ret < 0)) return ret; else if (ret) -- 2.35.1
From 0bc78c843b8636dcdfe45dd07328ca826fa67f9b Mon Sep 17 00:00:00 2001 From: Jens Axboe <axboe@xxxxxxxxx> Date: Fri, 2 Sep 2022 15:17:30 -0600 Subject: [PATCH 2/3] nvme: use separate end IO handler for IOPOLL Don't need to rely on the cookie or request type, set the right handler based on how we're handling the IO. Signed-off-by: Jens Axboe <axboe@xxxxxxxxx> --- drivers/nvme/host/ioctl.c | 30 ++++++++++++++++++++++-------- 1 file changed, 22 insertions(+), 8 deletions(-) diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c index 7756b439a688..f34abe95821e 100644 --- a/drivers/nvme/host/ioctl.c +++ b/drivers/nvme/host/ioctl.c @@ -385,25 +385,36 @@ static void nvme_uring_task_cb(struct io_uring_cmd *ioucmd) io_uring_cmd_done(ioucmd, status, result); } -static void nvme_uring_cmd_end_io(struct request *req, blk_status_t err) +static void nvme_uring_iopoll_cmd_end_io(struct request *req, blk_status_t err) { struct io_uring_cmd *ioucmd = req->end_io_data; struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd); /* extract bio before reusing the same field for request */ struct bio *bio = pdu->bio; - void *cookie = READ_ONCE(ioucmd->cookie); pdu->req = req; req->bio = bio; /* * For iopoll, complete it directly. - * Otherwise, move the completion to task work. */ - if (cookie != NULL && blk_rq_is_poll(req)) - nvme_uring_task_cb(ioucmd); - else - io_uring_cmd_complete_in_task(ioucmd, nvme_uring_task_cb); + nvme_uring_task_cb(ioucmd); +} + +static void nvme_uring_cmd_end_io(struct request *req, blk_status_t err) +{ + struct io_uring_cmd *ioucmd = req->end_io_data; + struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd); + /* extract bio before reusing the same field for request */ + struct bio *bio = pdu->bio; + + pdu->req = req; + req->bio = bio; + + /* + * Move the completion to task work. + */ + io_uring_cmd_complete_in_task(ioucmd, nvme_uring_task_cb); } static int nvme_uring_cmd_io(struct nvme_ctrl *ctrl, struct nvme_ns *ns, @@ -464,7 +475,10 @@ static int nvme_uring_cmd_io(struct nvme_ctrl *ctrl, struct nvme_ns *ns, blk_flags); if (IS_ERR(req)) return PTR_ERR(req); - req->end_io = nvme_uring_cmd_end_io; + if (issue_flags & IO_URING_F_IOPOLL) + req->end_io = nvme_uring_iopoll_cmd_end_io; + else + req->end_io = nvme_uring_cmd_end_io; req->end_io_data = ioucmd; if (issue_flags & IO_URING_F_IOPOLL && rq_flags & REQ_POLLED) { -- 2.35.1
From 9daa39b146f3a8f412196df5eb9f9686f308e5cc Mon Sep 17 00:00:00 2001 From: Jens Axboe <axboe@xxxxxxxxx> Date: Fri, 2 Sep 2022 15:18:05 -0600 Subject: [PATCH 3/3] fs: add batch and poll flags to the uring_cmd_iopoll() handler We need the poll_flags to know how to poll for the IO, and we should have the batch structure in preparation for supporting batched completions with iopoll. Signed-off-by: Jens Axboe <axboe@xxxxxxxxx> --- drivers/nvme/host/ioctl.c | 12 ++++++++---- drivers/nvme/host/nvme.h | 6 ++++-- include/linux/fs.h | 3 ++- io_uring/rw.c | 3 ++- 4 files changed, 16 insertions(+), 8 deletions(-) diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c index f34abe95821e..7a0b12ef49ae 100644 --- a/drivers/nvme/host/ioctl.c +++ b/drivers/nvme/host/ioctl.c @@ -637,7 +637,9 @@ int nvme_ns_chr_uring_cmd(struct io_uring_cmd *ioucmd, unsigned int issue_flags) return nvme_ns_uring_cmd(ns, ioucmd, issue_flags); } -int nvme_ns_chr_uring_cmd_iopoll(struct io_uring_cmd *ioucmd) +int nvme_ns_chr_uring_cmd_iopoll(struct io_uring_cmd *ioucmd, + struct io_comp_batch *iob, + unsigned int poll_flags) { struct bio *bio; int ret = 0; @@ -650,7 +652,7 @@ int nvme_ns_chr_uring_cmd_iopoll(struct io_uring_cmd *ioucmd) struct nvme_ns, cdev); q = ns->queue; if (test_bit(QUEUE_FLAG_POLL, &q->queue_flags) && bio && bio->bi_bdev) - ret = bio_poll(bio, NULL, 0); + ret = bio_poll(bio, iob, poll_flags); rcu_read_unlock(); return ret; } @@ -736,7 +738,9 @@ int nvme_ns_head_chr_uring_cmd(struct io_uring_cmd *ioucmd, return ret; } -int nvme_ns_head_chr_uring_cmd_iopoll(struct io_uring_cmd *ioucmd) +int nvme_ns_head_chr_uring_cmd_iopoll(struct io_uring_cmd *ioucmd, + struct io_comp_batch *iob, + unsigned int poll_flags) { struct cdev *cdev = file_inode(ioucmd->file)->i_cdev; struct nvme_ns_head *head = container_of(cdev, struct nvme_ns_head, cdev); @@ -752,7 +756,7 @@ int nvme_ns_head_chr_uring_cmd_iopoll(struct io_uring_cmd *ioucmd) q = ns->queue; if (test_bit(QUEUE_FLAG_POLL, &q->queue_flags) && bio && bio->bi_bdev) - ret = bio_poll(bio, NULL, 0); + ret = bio_poll(bio, iob, poll_flags); rcu_read_unlock(); } srcu_read_unlock(&head->srcu, srcu_idx); diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index fdcbc93dea21..216acbe953b3 100644 --- a/drivers/nvme/host/nvme.h +++ b/drivers/nvme/host/nvme.h @@ -821,8 +821,10 @@ long nvme_ns_head_chr_ioctl(struct file *file, unsigned int cmd, unsigned long arg); long nvme_dev_ioctl(struct file *file, unsigned int cmd, unsigned long arg); -int nvme_ns_chr_uring_cmd_iopoll(struct io_uring_cmd *ioucmd); -int nvme_ns_head_chr_uring_cmd_iopoll(struct io_uring_cmd *ioucmd); +int nvme_ns_chr_uring_cmd_iopoll(struct io_uring_cmd *ioucmd, + struct io_comp_batch *iob, unsigned int poll_flags); +int nvme_ns_head_chr_uring_cmd_iopoll(struct io_uring_cmd *ioucmd, + struct io_comp_batch *iob, unsigned int poll_flags); int nvme_ns_chr_uring_cmd(struct io_uring_cmd *ioucmd, unsigned int issue_flags); int nvme_ns_head_chr_uring_cmd(struct io_uring_cmd *ioucmd, diff --git a/include/linux/fs.h b/include/linux/fs.h index d6badd19784f..01681d061a6a 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2132,7 +2132,8 @@ struct file_operations { loff_t len, unsigned int remap_flags); int (*fadvise)(struct file *, loff_t, loff_t, int); int (*uring_cmd)(struct io_uring_cmd *ioucmd, unsigned int issue_flags); - int (*uring_cmd_iopoll)(struct io_uring_cmd *ioucmd); + int (*uring_cmd_iopoll)(struct io_uring_cmd *, struct io_comp_batch *, + unsigned int poll_flags); } __randomize_layout; struct inode_operations { diff --git a/io_uring/rw.c b/io_uring/rw.c index 3f03b6d2a5a3..4a061326c664 100644 --- a/io_uring/rw.c +++ b/io_uring/rw.c @@ -1009,7 +1009,8 @@ int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin) struct io_uring_cmd *ioucmd; ioucmd = io_kiocb_to_cmd(req, struct io_uring_cmd); - ret = file->f_op->uring_cmd_iopoll(ioucmd, poll_flags); + ret = file->f_op->uring_cmd_iopoll(ioucmd, &iob, + poll_flags); } else { struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw); -- 2.35.1