Re: [PATCH] blk-mq: update hctx->dispatch_busy in case of real scheduler

Ming Lei <ming.lei@xxxxxxxxxx> · Tue, 1 Jun 2021 18:01:29 +0800

On Mon, May 31, 2021 at 02:04:34PM +0200, Jan Kara wrote:
> On Mon 31-05-21 09:37:01, Ming Lei wrote:
> > On Fri, May 28, 2021 at 02:26:31PM +0200, Jan Kara wrote:
> > > On Fri 28-05-21 11:20:55, Ming Lei wrote:
> > > > Commit 6e6fcbc27e77 ("blk-mq: support batching dispatch in case of io")
> > > > starts to support io batching submission by using hctx->dispatch_busy.
> > > > 
> > > > However, blk_mq_update_dispatch_busy() isn't changed to update hctx->dispatch_busy
> > > > in that commit, so fix the issue by updating hctx->dispatch_busy in case
> > > > of real scheduler.
> > > > 
> > > > Reported-by: Jan Kara <jack@xxxxxxx>
> > > > Fixes: 6e6fcbc27e77 ("blk-mq: support batching dispatch in case of io")
> > > > Signed-off-by: Ming Lei <ming.lei@xxxxxxxxxx>
> > > > ---
> > > >  block/blk-mq.c | 3 ---
> > > >  1 file changed, 3 deletions(-)
> > > 
> > > Looks good to me. You can add:
> > > 
> > > Reviewed-by: Jan Kara <jack@xxxxxxx>
> > > 
> > > BTW: Do you plan to submit also your improvement to
> > > __blk_mq_do_dispatch_sched() to update dispatch_busy during the fetching
> > > requests from the scheduler to avoid draining all requests from the IO
> > > scheduler?
> > 
> > I understand that kind of change isn't needed. When more requests are
> > dequeued, hctx->dispatch_busy will be updated, then __blk_mq_do_dispatch_sched()
> > won't dequeue at batch any more if either .queue_rq() returns
> > STS_RESOURCE or running out of driver tag/budget.
> > 
> > Or do you still see related issues after this patch is applied?
> 
> I was suspicious that __blk_mq_do_dispatch_sched() would be still pulling
> requests too aggressively from the IO scheduler (which effectively defeats
> impact of cgroup IO weights on observed throughput). Now I did a few more
> experiments with the workload doing multiple iterations for each kernel and
> comparing ratios of achieved throughput when cgroup weights were in 2:1
> ratio.
> 
> With this patch alone, I've got no significant distinction between IO from
> two cgroups in 4 out of 5 test iterations. With your patch to update
> max_dispatch in __blk_mq_do_dispatch_sched() applied on top the results
> were not significantly different (my previous test result was likely a
> lucky chance). With my original patch to allocate driver tags early in
> __blk_mq_do_dispatch_sched() I get reliable distinction between cgroups -
> the worst ratio from all the iterations is 1.4, average ratio is ~1.75.
> This last result is btw very similar to ratios I can see when using
> virtio-scsi instead of virtio-blk for the backing storage which is kind of
> natural because virtio-scsi ends up using the dispatch-budget logic of SCSI
> subsystem. I'm not saying my patch is the right way to do things but it
> clearly shows that __blk_mq_do_dispatch_sched() is still too aggressive
> pulling requests out of the IO scheduler.

IMO getting driver tag before dequeue should be one good way for this
issue, and that is exactly what the legacy request IO code path did. Not
only address this issue, but also get better leverage between batching
dispatch and IO request merge.

But it can't work by simply adding blk_mq_get_driver_tag() in
__blk_mq_do_dispatch_sched(), and one big problem is that how to re-run
queue in case of running out of getting driver tag. That said we need to
refactor blk_mq_dispatch_rq_list(), such as moving handling of running
out of driver tag into __blk_mq_do_dispatch_sched(). I will investigate a
bit on this change.

Thanks,
Ming