Re: [PATCH] ublk_drv: don't call task_work_add for queueing io commands

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 2022/10/25 23:17, Ming Lei wrote:
> On Tue, Oct 25, 2022 at 04:43:56PM +0800, Ziyang Zhang wrote:
>> On 2022/10/25 15:46, Ziyang Zhang wrote:
>>> On 2022/10/25 15:19, Ming Lei wrote:
>>>> On Tue, Oct 25, 2022 at 11:15:57AM +0800, Ziyang Zhang wrote:
>>>>> On 2022/10/24 21:20, Ming Lei wrote:
>>>>>> Hello Ziyang,
>>>>>>
>>>>>> On Mon, Oct 24, 2022 at 05:48:51PM +0800, Ziyang Zhang wrote:
>>>>>>> On 2022/10/23 17:38, Ming Lei wrote:
>>>>>>>> task_work_add() is used for waking ubq daemon task with one batch
>>>>>>>> of io requests/commands queued. However, task_work_add() isn't
>>>>>>>> exported for module code, and it is still debatable if the symbol
>>>>>>>> should be exported.
>>>>>>>>
>>>>>>>> Fortunately we still have io_uring_cmd_complete_in_task() which just
>>>>>>>> can't handle batched wakeup for us.
>>>>>>>>
>>>>>>>> Add one one llist into ublk_queue and call io_uring_cmd_complete_in_task()
>>>>>>>> via current command for running them via task work.
>>>>>>>>
>>>>>>>> This way cleans up current code a lot, meantime allow us to wakeup
>>>>>>>> ubq daemon task after queueing batched requests/io commands.
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Hi, Ming
>>>>>>>
>>>>>>> This patch works and I have run some tests to compare current version(ucmd)
>>>>>>> with your patch(ucmd-batch).
>>>>>>>
>>>>>>> iodepth=128 numjobs=1 direct=1 bs=4k
>>>>>>>
>>>>>>> --------------------------------------------
>>>>>>> ublk loop target, the backend is a file.
>>>>>>> IOPS(k)
>>>>>>>
>>>>>>> type		ucmd		ucmd-batch
>>>>>>> seq-read	54.7		54.2	
>>>>>>> rand-read	52.8		52.0
>>>>>>>
>>>>>>> --------------------------------------------
>>>>>>> ublk null target
>>>>>>> IOPS(k)
>>>>>>>
>>>>>>> type		ucmd		ucmd-batch
>>>>>>> seq-read	257		257
>>>>>>> rand-read	252		253
>>>>>>>
>>>>>>>
>>>>>>> I find that io_req_task_work_add() puts task_work node into a llist
>>>>>>> first, then it may call task_work_add() to run batched task_works. So do we really
>>>>>>> need such llist in ublk_drv? I think io_uring has already considered task_work batch
>>>>>>> optimization.
>>>>>>>
>>>>>>> BTW, task_work_add() in ublk_drv achieves
>>>>>>> higher IOPS(about 5-10% on my machine) than io_uring_cmd_complete_in_task()
>>>>>>> in ublk_drv.
>>>>>>
>>>>>> Yeah, that is same with my observation, and motivation of this patch is
>>>>>> to get same performance with task_work_add by building ublk_drv as
>>>>>> module. One win of task_work_add() is that we get exact batching info
>>>>>> meantime only send TWA_SIGNAL_NO_IPI for whole batch, that is basically
>>>>>> what the patch is doing, but needs help of the following ublksrv patch:
>>>>>>
>>>>>> https://github.com/ming1/ubdsrv/commit/dce6d1d222023c1641292713b311ced01e6dc548
>>>>>>
>>>>>> which sets IORING_SETUP_COOP_TASKRUN for ublksrv's uring, then
>>>>>> io_uring_cmd_complete_in_task will notify via TWA_SIGNAL_NO_IPI, and 5+%
>>>>>> IOPS boost is observed on loop/001 by putting image on SSD in my test
>>>>>> VM.
>>>>>
>>>>> Hi Ming,
>>>>>
>>>>> I have added this ublksrv patch and run the above test again.
>>>>> I have also run ublksrv test: loop/001. Please check them.
>>>>>
>>>>> Intel(R) Xeon(R) Platinum 8369B CPU @ 2.70GHz 16 cores
>>>>> 64GB MEM, CentOS 8, kernel 6.0+
>>>>>
>>>>> --------
>>>>> fio test
>>>>>
>>>>> iodepth=128 numjobs=1 direct=1 bs=4k
>>>>>
>>>>> ucmd: without your kernel patch. Run io_uring_cmd_complete_in_task()
>>>>> for each blk-mq rq.
>>>>>
>>>>> ucmd-batch: with your kernel patch. Run io_uring_cmd_complete_in_task()
>>>>> for the last blk-mq rq.
>>>>>
>>>>> --------------------------------------------
>>>>> ublk loop target, the backend is a file.
>>>>>
>>>>> IOPS(k)
>>>>>
>>>>> type		ucmd		ucmd-batch
>>>>> seq-read	54.1		53.7
>>>>> rand-read	52.0		52.0
>>>>>
>>>>> --------------------------------------------
>>>>> ublk null target
>>>>> IOPS(k)
>>>>>
>>>>> type		ucmd		ucmd-batch
>>>>> seq-read	272		265
>>>>> rand-read	262		260
>>>>>
>>>>> ------------
>>>>> ublksrv test
>>>>>
>>>>> -------------
>>>>> ucmd
>>>>>
>>>>> running loop/001
>>>>>         fio (ublk/loop( -f /root/work/ubdsrv/tests/tmp/ublk_loop_1G_BZ85U), libaio, bs 4k, dio, hw queues:1, uring_comp: 0, get_data: 0)...
>>>>>         randwrite: jobs 1, iops 66737
>>>>>         randread: jobs 1, iops 64935
>>>>>         randrw: jobs 1, iops read 32694 write 32710
>>>>>         rw(512k): jobs 1, iops read 772 write 819
>>>>>
>>>>> -------------
>>>>> ucmd-batch
>>>>>
>>>>> running loop/001
>>>>>         fio (ublk/loop( -f /root/work/ubdsrv/tests/tmp/ublk_loop_1G_F56a3), libaio, bs 4k, dio, hw queues:1, uring_comp: 0, get_data: 0)...
>>>>>         randwrite: jobs 1, iops 66720
>>>>>         randread: jobs 1, iops 65015
>>>>>         randrw: jobs 1, iops read 32743 write 32759
>>>>>         rw(512k): jobs 1, iops read 771 write 817
>>>>>
>>>>>
>>>>> It seems that manually putting rqs into llist and calling
>>>>> io_uring_cmd_complete_in_task() while handling the last rq does
>>>>> not improve IOPS much.
>>>>>
>>>>> io_req_task_work_add() puts task_work node into a internal llist
>>>>> first, then it may call task_work_add() to run batched task_works.
>>>>> IMO, io_uring has already done such batch optimization and ublk_drv
>>>>> does not need to add such llist.
>>>>
>>>> The difference is just how batching is handled, looks blk-mq's batch info
>>>> doesn't matter any more. In my test, looks the perf improvement is mainly
>>>> made by enabling IORING_SETUP_COOP_TASKRUN in ublksrv.
>>>
>>> I guess only IORING_SETUP_COOP_TASKRUN helps improve IOPS. The llist in
>>> ublk_drv does not improve IOPS.
>>>
>>>>
>>>> Can you check if enabling IORING_SETUP_COOP_TASKRUN only can reach
>>>> same perf with task_work_add()(ublk_drv is builtin) when building
>>>> ublk_drv as module?
>>>>
>>>
>>> OK.
>>>
>>
>> Intel(R) Xeon(R) Platinum 8369B CPU @ 2.70GHz 16 cores
>> 64GB MEM, CentOS 8, kernel 6.0+
>> with IORING_SETUP_COOP_TASKRUN, without this kernel patch
>>
>> ucmd: io_uring_cmd_complete_in_task(), ublk_drv is a module
>> tw: task_work_add(), ublk is built-in.
>>
>>
>> --------
>> fio test
>>
>> iodepth=128 numjobs=1 direct=1 bs=4k
>>
>> --------------------------------------------
>> ublk loop target, the backend is a file.
>>
>> IOPS(k)
>>
>> type		ucmd		tw
>> seq-read	54.1		53.8
>> rand-read	52.0		52.0
>>
>> --------------------------------------------
>> ublk null target
>> IOPS(k)
>>
>> type		ucmd		tw
>> seq-read	272		286
>> rand-read	262		278
>>
>>
>> ------------
>> ublksrv test
>>
>> -------------
>> ucmd
>>
>> running loop/001
>>         fio (ublk/loop( -f /root/work/ubdsrv/tests/tmp/ublk_loop_1G_BZ85U), libaio, bs 4k, dio, hw queues:1, uring_comp: 0, get_data: 0)...
>>         randwrite: jobs 1, iops 66737
>>         randread: jobs 1, iops 64935
>>         randrw: jobs 1, iops read 32694 write 32710
>>         rw(512k): jobs 1, iops read 772 write 819
>>
>> running null/001
>>         fio (ublk/null(), libaio, bs 4k, dio, hw queues:1, uring_comp: 0, get_data: 0)...
>>         randwrite: jobs 1, iops 715863
>>         randread: jobs 1, iops 758449
>>         randrw: jobs 1, iops read 357407 write 357183
>>         rw(512k): jobs 1, iops read 5895 write 5875
>>
>> -------------
>> tw
>>
>> running loop/001
>>         fio (ublk/loop( -f /root/work/ubdsrv/tests/tmp/ublk_loop_1G_pvLTL), libaio, bs 4k, dio, hw queues:1, uring_comp: 0, get_data: 0)...
>>         randwrite: jobs 1, iops 66856
>>         randread: jobs 1, iops 65015
>>         randrw: jobs 1, iops read 32751 write 32767
>>         rw(512k): jobs 1, iops read 776 write 823
>>
>> running null/001
>>         fio (ublk/null(), libaio, bs 4k, dio, hw queues:1, uring_comp: 0, get_data: 0)...
>>         randwrite: jobs 1, iops 739450
>>         randread: jobs 1, iops 787500
>>         randrw: jobs 1, iops read 372956 write 372831
>>         rw(512k): jobs 1, iops read 5798 write 5777
> 
> Looks the gap isn't big between ucmd and tw when running null/001, in
> which the fio io process should saturate the CPU. Probably we
> should avoid to touch 'cmd'/'pdu'/'io' in ublk_queue_rq() since these
> data should be cold at that time.
> 
> Can you apply the following delta patch against the current patch(
> "ublk_drv: don't call task_work_add for queueing io commands") and
> compare with task_work_add()?
> 
> From ecbbf6d10dbc63e279ce0b55c45da6721947f18d Mon Sep 17 00:00:00 2001
> From: Ming Lei <ming.lei@xxxxxxxxxx>
> Date: Tue, 25 Oct 2022 11:01:25 -0400
> Subject: [PATCH] ublk: follow up change
> 
> Signed-off-by: Ming Lei <ming.lei@xxxxxxxxxx>
> ---
>  drivers/block/ublk_drv.c | 36 ++++++++++++++++++++++--------------
>  1 file changed, 22 insertions(+), 14 deletions(-)
> 
> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> index 7963fba66dd1..18db337094c1 100644
> --- a/drivers/block/ublk_drv.c
> +++ b/drivers/block/ublk_drv.c
> @@ -56,9 +56,12 @@
>  /* All UBLK_PARAM_TYPE_* should be included here */
>  #define UBLK_PARAM_TYPE_ALL (UBLK_PARAM_TYPE_BASIC | UBLK_PARAM_TYPE_DISCARD)
>  
> +struct ublk_rq_data {
> +	struct llist_node node;
> +};
> +
>  struct ublk_uring_cmd_pdu {
>  	struct request *req;
> -	struct llist_node node;
>  };
>  
>  /*
> @@ -693,7 +696,8 @@ static inline void __ublk_rq_task_work(struct request *req)
>  	 *
>  	 * (2) current->flags & PF_EXITING.
>  	 */
> -	if (unlikely(current != ubq->ubq_daemon || current->flags & PF_EXITING)) {
> +	if (unlikely(current != ubq->ubq_daemon || current->flags & PF_EXITING
> +				|| (io->flags & UBLK_IO_FLAG_ABORTED))) {
>  		__ublk_abort_rq(ubq, req);
>  		return;

We cannot check io->flags & UBLK_IO_FLAG_ABORTED in io_uring task_work. 
We have to check io->flags & UBLK_IO_FLAG_ABORTED in ublk_queue_rq() because
io_uring ctx may be freed at that time and ioucmd task_work is invalid(freed).
Please See comment around it for more detail.

>  	}
> @@ -757,11 +761,12 @@ static void ublk_rqs_task_work_cb(struct io_uring_cmd *cmd)
>  	struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd);
>  	struct ublk_queue *ubq = pdu->req->mq_hctx->driver_data;
>  	struct llist_node *io_cmds = llist_del_all(&ubq->io_cmds);
> +	struct ublk_rq_data *data;
>  
>  	__ublk_rq_task_work(pdu->req);
>  
> -	llist_for_each_entry(pdu, io_cmds, node)
> -		__ublk_rq_task_work(pdu->req);
> +	llist_for_each_entry(data, io_cmds, node)
> +		__ublk_rq_task_work(blk_mq_rq_from_pdu(data));
>  }
>  
>  static blk_status_t ublk_queue_rq(struct blk_mq_hw_ctx *hctx,
> @@ -769,9 +774,6 @@ static blk_status_t ublk_queue_rq(struct blk_mq_hw_ctx *hctx,
>  {
>  	struct ublk_queue *ubq = hctx->driver_data;
>  	struct request *rq = bd->rq;
> -	struct ublk_io *io = &ubq->ios[rq->tag];
> -	struct io_uring_cmd *cmd = io->cmd;
> -	struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd);
>  	blk_status_t res;
>  
>  	/* fill iod to slot in io cmd buffer */
> @@ -805,14 +807,11 @@ static blk_status_t ublk_queue_rq(struct blk_mq_hw_ctx *hctx,
>  	 * Since releasing/allocating a tag implies smp_mb(), finding UBLK_IO_FLAG_ABORTED
>  	 * guarantees that here is a re-issued request aborted previously.
>  	 */
> -	if (unlikely(ubq_daemon_is_dying(ubq) ||
> -				(io->flags & UBLK_IO_FLAG_ABORTED))) {
> +	if (unlikely(ubq_daemon_is_dying(ubq))) {
>  		__ublk_abort_rq(ubq, rq);
>  		return BLK_STS_OK;
>  	}
>  
> -	pdu->req = rq;
> -
>  	/*
>  	 * Typical multiple producers and single consumer, it is just fine
>  	 * to use llist_add() in producer side and llist_del_all() in
> @@ -821,10 +820,18 @@ static blk_status_t ublk_queue_rq(struct blk_mq_hw_ctx *hctx,
>  	 * The last command can't be added into list, otherwise it could
>  	 * be handled twice
>  	 */
> -	if (bd->last)
> +	if (bd->last) {
> +		struct ublk_io *io = &ubq->ios[rq->tag];
> +		struct io_uring_cmd *cmd = io->cmd;
> +		struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd);
> +
> +		pdu->req = rq;
>  		io_uring_cmd_complete_in_task(cmd, ublk_rqs_task_work_cb);
> -	else
> -		llist_add(&pdu->node, &ubq->io_cmds);
> +	} else {
> +		struct ublk_rq_data *data = blk_mq_rq_to_pdu(rq);
> +
> +		llist_add(&data->node, &ubq->io_cmds);
> +	}
>  
>  	return BLK_STS_OK;
>  }
> @@ -1426,6 +1433,7 @@ static int ublk_add_tag_set(struct ublk_device *ub)
>  	ub->tag_set.queue_depth = ub->dev_info.queue_depth;
>  	ub->tag_set.numa_node = NUMA_NO_NODE;
>  	ub->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
> +	ub->tag_set.cmd_size = sizeof(struct ublk_rq_data);
>  	ub->tag_set.driver_data = ub;
>  	return blk_mq_alloc_tag_set(&ub->tag_set);
>  }


Intel(R) Xeon(R) Platinum 8369B CPU @ 2.70GHz 16 cores
64GB MEM, CentOS 8, kernel 6.0+
with IORING_SETUP_COOP_TASKRUN, without this kernel patch

ucmd: io_uring_cmd_complete_in_task(), ublk_drv is a module

ucmd-not-touch-pdu: use llist && do not touch 'cmd'/'pdu'/'io' in ublk_queue_rq()

tw: task_work_add(), ublk is built-in.


--------
fio test

iodepth=128 numjobs=1 direct=1 bs=4k

--------------------------------------------
ublk loop target, the backend is a file.

IOPS(k)

type		ucmd		tw		ucmd-not-touch-pdu
seq-read	54.1		53.8		53.6
rand-read	52.0		52.0		52.0

--------------------------------------------
ublk null target
IOPS(k)

type		ucmd		tw		ucmd-not-touch-pdu
seq-read	272		286		275
rand-read	262		278		269


------------
ublksrv test

-------------
ucmd

running loop/001
        fio (ublk/loop( -f /root/work/ubdsrv/tests/tmp/ublk_loop_1G_BZ85U), libaio, bs 4k, dio, hw queues:1, uring_comp: 0, get_data: 0)...
        randwrite: jobs 1, iops 66737
        randread: jobs 1, iops 64935
        randrw: jobs 1, iops read 32694 write 32710
        rw(512k): jobs 1, iops read 772 write 819

running null/001
        fio (ublk/null(), libaio, bs 4k, dio, hw queues:1, uring_comp: 0, get_data: 0)...
        randwrite: jobs 1, iops 715863
        randread: jobs 1, iops 758449
        randrw: jobs 1, iops read 357407 write 357183
        rw(512k): jobs 1, iops read 5895 write 5875

-------------
tw

running loop/001
        fio (ublk/loop( -f /root/work/ubdsrv/tests/tmp/ublk_loop_1G_pvLTL), libaio, bs 4k, dio, hw queues:1, uring_comp: 0, get_data: 0)...
        randwrite: jobs 1, iops 66856
        randread: jobs 1, iops 65015
        randrw: jobs 1, iops read 32751 write 32767
        rw(512k): jobs 1, iops read 776 write 823

running null/001
        fio (ublk/null(), libaio, bs 4k, dio, hw queues:1, uring_comp: 0, get_data: 0)...
        randwrite: jobs 1, iops 739450
        randread: jobs 1, iops 787500
        randrw: jobs 1, iops read 372956 write 372831
        rw(512k): jobs 1, iops read 5798 write 5777

-------------
ucmd-not-touch-pdu

running loop/001
        fio (ublk/loop( -f /root/work/ubdsrv/tests/tmp/ublk_loop_1G_oH0eG), libaio, bs 4k, dio, hw queues:1, uring_comp: 0, get_data: 0)...
        randwrite: jobs 1, iops 66754
        randread: jobs 1, iops 65032
        randrw: jobs 1, iops read 32776 write 32792
        rw(512k): jobs 1, iops read 772 write 818

running null/001
        fio (ublk/null(), libaio, bs 4k, dio, hw queues:1, uring_comp: 0, get_data: 0)...
        randwrite: jobs 1, iops 725334
        randread: jobs 1, iops 741105
        randrw: jobs 1, iops read 360285 write 360047
        rw(512k): jobs 1, iops read 5770 write 5748

Not touching cmd/pdu/io in ublk_queue_rq() improves IOPS.
But it is worse than using task_work_add().

Regards,
Zhang





[Index of Archives]     [Linux RAID]     [Linux SCSI]     [Linux ATA RAID]     [IDE]     [Linux Wireless]     [Linux Kernel]     [ATH6KL]     [Linux Bluetooth]     [Linux Netdev]     [Kernel Newbies]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Device Mapper]

  Powered by Linux