On Tue, Oct 25, 2022 at 04:43:56PM +0800, Ziyang Zhang wrote: > On 2022/10/25 15:46, Ziyang Zhang wrote: > > On 2022/10/25 15:19, Ming Lei wrote: > >> On Tue, Oct 25, 2022 at 11:15:57AM +0800, Ziyang Zhang wrote: > >>> On 2022/10/24 21:20, Ming Lei wrote: > >>>> Hello Ziyang, > >>>> > >>>> On Mon, Oct 24, 2022 at 05:48:51PM +0800, Ziyang Zhang wrote: > >>>>> On 2022/10/23 17:38, Ming Lei wrote: > >>>>>> task_work_add() is used for waking ubq daemon task with one batch > >>>>>> of io requests/commands queued. However, task_work_add() isn't > >>>>>> exported for module code, and it is still debatable if the symbol > >>>>>> should be exported. > >>>>>> > >>>>>> Fortunately we still have io_uring_cmd_complete_in_task() which just > >>>>>> can't handle batched wakeup for us. > >>>>>> > >>>>>> Add one one llist into ublk_queue and call io_uring_cmd_complete_in_task() > >>>>>> via current command for running them via task work. > >>>>>> > >>>>>> This way cleans up current code a lot, meantime allow us to wakeup > >>>>>> ubq daemon task after queueing batched requests/io commands. > >>>>>> > >>>>> > >>>>> > >>>>> Hi, Ming > >>>>> > >>>>> This patch works and I have run some tests to compare current version(ucmd) > >>>>> with your patch(ucmd-batch). > >>>>> > >>>>> iodepth=128 numjobs=1 direct=1 bs=4k > >>>>> > >>>>> -------------------------------------------- > >>>>> ublk loop target, the backend is a file. > >>>>> IOPS(k) > >>>>> > >>>>> type ucmd ucmd-batch > >>>>> seq-read 54.7 54.2 > >>>>> rand-read 52.8 52.0 > >>>>> > >>>>> -------------------------------------------- > >>>>> ublk null target > >>>>> IOPS(k) > >>>>> > >>>>> type ucmd ucmd-batch > >>>>> seq-read 257 257 > >>>>> rand-read 252 253 > >>>>> > >>>>> > >>>>> I find that io_req_task_work_add() puts task_work node into a llist > >>>>> first, then it may call task_work_add() to run batched task_works. So do we really > >>>>> need such llist in ublk_drv? I think io_uring has already considered task_work batch > >>>>> optimization. > >>>>> > >>>>> BTW, task_work_add() in ublk_drv achieves > >>>>> higher IOPS(about 5-10% on my machine) than io_uring_cmd_complete_in_task() > >>>>> in ublk_drv. > >>>> > >>>> Yeah, that is same with my observation, and motivation of this patch is > >>>> to get same performance with task_work_add by building ublk_drv as > >>>> module. One win of task_work_add() is that we get exact batching info > >>>> meantime only send TWA_SIGNAL_NO_IPI for whole batch, that is basically > >>>> what the patch is doing, but needs help of the following ublksrv patch: > >>>> > >>>> https://github.com/ming1/ubdsrv/commit/dce6d1d222023c1641292713b311ced01e6dc548 > >>>> > >>>> which sets IORING_SETUP_COOP_TASKRUN for ublksrv's uring, then > >>>> io_uring_cmd_complete_in_task will notify via TWA_SIGNAL_NO_IPI, and 5+% > >>>> IOPS boost is observed on loop/001 by putting image on SSD in my test > >>>> VM. > >>> > >>> Hi Ming, > >>> > >>> I have added this ublksrv patch and run the above test again. > >>> I have also run ublksrv test: loop/001. Please check them. > >>> > >>> Intel(R) Xeon(R) Platinum 8369B CPU @ 2.70GHz 16 cores > >>> 64GB MEM, CentOS 8, kernel 6.0+ > >>> > >>> -------- > >>> fio test > >>> > >>> iodepth=128 numjobs=1 direct=1 bs=4k > >>> > >>> ucmd: without your kernel patch. Run io_uring_cmd_complete_in_task() > >>> for each blk-mq rq. > >>> > >>> ucmd-batch: with your kernel patch. Run io_uring_cmd_complete_in_task() > >>> for the last blk-mq rq. > >>> > >>> -------------------------------------------- > >>> ublk loop target, the backend is a file. > >>> > >>> IOPS(k) > >>> > >>> type ucmd ucmd-batch > >>> seq-read 54.1 53.7 > >>> rand-read 52.0 52.0 > >>> > >>> -------------------------------------------- > >>> ublk null target > >>> IOPS(k) > >>> > >>> type ucmd ucmd-batch > >>> seq-read 272 265 > >>> rand-read 262 260 > >>> > >>> ------------ > >>> ublksrv test > >>> > >>> ------------- > >>> ucmd > >>> > >>> running loop/001 > >>> fio (ublk/loop( -f /root/work/ubdsrv/tests/tmp/ublk_loop_1G_BZ85U), libaio, bs 4k, dio, hw queues:1, uring_comp: 0, get_data: 0)... > >>> randwrite: jobs 1, iops 66737 > >>> randread: jobs 1, iops 64935 > >>> randrw: jobs 1, iops read 32694 write 32710 > >>> rw(512k): jobs 1, iops read 772 write 819 > >>> > >>> ------------- > >>> ucmd-batch > >>> > >>> running loop/001 > >>> fio (ublk/loop( -f /root/work/ubdsrv/tests/tmp/ublk_loop_1G_F56a3), libaio, bs 4k, dio, hw queues:1, uring_comp: 0, get_data: 0)... > >>> randwrite: jobs 1, iops 66720 > >>> randread: jobs 1, iops 65015 > >>> randrw: jobs 1, iops read 32743 write 32759 > >>> rw(512k): jobs 1, iops read 771 write 817 > >>> > >>> > >>> It seems that manually putting rqs into llist and calling > >>> io_uring_cmd_complete_in_task() while handling the last rq does > >>> not improve IOPS much. > >>> > >>> io_req_task_work_add() puts task_work node into a internal llist > >>> first, then it may call task_work_add() to run batched task_works. > >>> IMO, io_uring has already done such batch optimization and ublk_drv > >>> does not need to add such llist. > >> > >> The difference is just how batching is handled, looks blk-mq's batch info > >> doesn't matter any more. In my test, looks the perf improvement is mainly > >> made by enabling IORING_SETUP_COOP_TASKRUN in ublksrv. > > > > I guess only IORING_SETUP_COOP_TASKRUN helps improve IOPS. The llist in > > ublk_drv does not improve IOPS. > > > >> > >> Can you check if enabling IORING_SETUP_COOP_TASKRUN only can reach > >> same perf with task_work_add()(ublk_drv is builtin) when building > >> ublk_drv as module? > >> > > > > OK. > > > > Intel(R) Xeon(R) Platinum 8369B CPU @ 2.70GHz 16 cores > 64GB MEM, CentOS 8, kernel 6.0+ > with IORING_SETUP_COOP_TASKRUN, without this kernel patch > > ucmd: io_uring_cmd_complete_in_task(), ublk_drv is a module > tw: task_work_add(), ublk is built-in. > > > -------- > fio test > > iodepth=128 numjobs=1 direct=1 bs=4k > > -------------------------------------------- > ublk loop target, the backend is a file. > > IOPS(k) > > type ucmd tw > seq-read 54.1 53.8 > rand-read 52.0 52.0 > > -------------------------------------------- > ublk null target > IOPS(k) > > type ucmd tw > seq-read 272 286 > rand-read 262 278 > > > ------------ > ublksrv test > > ------------- > ucmd > > running loop/001 > fio (ublk/loop( -f /root/work/ubdsrv/tests/tmp/ublk_loop_1G_BZ85U), libaio, bs 4k, dio, hw queues:1, uring_comp: 0, get_data: 0)... > randwrite: jobs 1, iops 66737 > randread: jobs 1, iops 64935 > randrw: jobs 1, iops read 32694 write 32710 > rw(512k): jobs 1, iops read 772 write 819 > > running null/001 > fio (ublk/null(), libaio, bs 4k, dio, hw queues:1, uring_comp: 0, get_data: 0)... > randwrite: jobs 1, iops 715863 > randread: jobs 1, iops 758449 > randrw: jobs 1, iops read 357407 write 357183 > rw(512k): jobs 1, iops read 5895 write 5875 > > ------------- > tw > > running loop/001 > fio (ublk/loop( -f /root/work/ubdsrv/tests/tmp/ublk_loop_1G_pvLTL), libaio, bs 4k, dio, hw queues:1, uring_comp: 0, get_data: 0)... > randwrite: jobs 1, iops 66856 > randread: jobs 1, iops 65015 > randrw: jobs 1, iops read 32751 write 32767 > rw(512k): jobs 1, iops read 776 write 823 > > running null/001 > fio (ublk/null(), libaio, bs 4k, dio, hw queues:1, uring_comp: 0, get_data: 0)... > randwrite: jobs 1, iops 739450 > randread: jobs 1, iops 787500 > randrw: jobs 1, iops read 372956 write 372831 > rw(512k): jobs 1, iops read 5798 write 5777 Looks the gap isn't big between ucmd and tw when running null/001, in which the fio io process should saturate the CPU. Probably we should avoid to touch 'cmd'/'pdu'/'io' in ublk_queue_rq() since these data should be cold at that time. Can you apply the following delta patch against the current patch( "ublk_drv: don't call task_work_add for queueing io commands") and compare with task_work_add()? >From ecbbf6d10dbc63e279ce0b55c45da6721947f18d Mon Sep 17 00:00:00 2001 From: Ming Lei <ming.lei@xxxxxxxxxx> Date: Tue, 25 Oct 2022 11:01:25 -0400 Subject: [PATCH] ublk: follow up change Signed-off-by: Ming Lei <ming.lei@xxxxxxxxxx> --- drivers/block/ublk_drv.c | 36 ++++++++++++++++++++++-------------- 1 file changed, 22 insertions(+), 14 deletions(-) diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c index 7963fba66dd1..18db337094c1 100644 --- a/drivers/block/ublk_drv.c +++ b/drivers/block/ublk_drv.c @@ -56,9 +56,12 @@ /* All UBLK_PARAM_TYPE_* should be included here */ #define UBLK_PARAM_TYPE_ALL (UBLK_PARAM_TYPE_BASIC | UBLK_PARAM_TYPE_DISCARD) +struct ublk_rq_data { + struct llist_node node; +}; + struct ublk_uring_cmd_pdu { struct request *req; - struct llist_node node; }; /* @@ -693,7 +696,8 @@ static inline void __ublk_rq_task_work(struct request *req) * * (2) current->flags & PF_EXITING. */ - if (unlikely(current != ubq->ubq_daemon || current->flags & PF_EXITING)) { + if (unlikely(current != ubq->ubq_daemon || current->flags & PF_EXITING + || (io->flags & UBLK_IO_FLAG_ABORTED))) { __ublk_abort_rq(ubq, req); return; } @@ -757,11 +761,12 @@ static void ublk_rqs_task_work_cb(struct io_uring_cmd *cmd) struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd); struct ublk_queue *ubq = pdu->req->mq_hctx->driver_data; struct llist_node *io_cmds = llist_del_all(&ubq->io_cmds); + struct ublk_rq_data *data; __ublk_rq_task_work(pdu->req); - llist_for_each_entry(pdu, io_cmds, node) - __ublk_rq_task_work(pdu->req); + llist_for_each_entry(data, io_cmds, node) + __ublk_rq_task_work(blk_mq_rq_from_pdu(data)); } static blk_status_t ublk_queue_rq(struct blk_mq_hw_ctx *hctx, @@ -769,9 +774,6 @@ static blk_status_t ublk_queue_rq(struct blk_mq_hw_ctx *hctx, { struct ublk_queue *ubq = hctx->driver_data; struct request *rq = bd->rq; - struct ublk_io *io = &ubq->ios[rq->tag]; - struct io_uring_cmd *cmd = io->cmd; - struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd); blk_status_t res; /* fill iod to slot in io cmd buffer */ @@ -805,14 +807,11 @@ static blk_status_t ublk_queue_rq(struct blk_mq_hw_ctx *hctx, * Since releasing/allocating a tag implies smp_mb(), finding UBLK_IO_FLAG_ABORTED * guarantees that here is a re-issued request aborted previously. */ - if (unlikely(ubq_daemon_is_dying(ubq) || - (io->flags & UBLK_IO_FLAG_ABORTED))) { + if (unlikely(ubq_daemon_is_dying(ubq))) { __ublk_abort_rq(ubq, rq); return BLK_STS_OK; } - pdu->req = rq; - /* * Typical multiple producers and single consumer, it is just fine * to use llist_add() in producer side and llist_del_all() in @@ -821,10 +820,18 @@ static blk_status_t ublk_queue_rq(struct blk_mq_hw_ctx *hctx, * The last command can't be added into list, otherwise it could * be handled twice */ - if (bd->last) + if (bd->last) { + struct ublk_io *io = &ubq->ios[rq->tag]; + struct io_uring_cmd *cmd = io->cmd; + struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd); + + pdu->req = rq; io_uring_cmd_complete_in_task(cmd, ublk_rqs_task_work_cb); - else - llist_add(&pdu->node, &ubq->io_cmds); + } else { + struct ublk_rq_data *data = blk_mq_rq_to_pdu(rq); + + llist_add(&data->node, &ubq->io_cmds); + } return BLK_STS_OK; } @@ -1426,6 +1433,7 @@ static int ublk_add_tag_set(struct ublk_device *ub) ub->tag_set.queue_depth = ub->dev_info.queue_depth; ub->tag_set.numa_node = NUMA_NO_NODE; ub->tag_set.flags = BLK_MQ_F_SHOULD_MERGE; + ub->tag_set.cmd_size = sizeof(struct ublk_rq_data); ub->tag_set.driver_data = ub; return blk_mq_alloc_tag_set(&ub->tag_set); } -- 2.31.1 Thanks, Ming