Re: [PATCH 1/2] blk-mq: add requests in the tail of hctx->dispatch

Ming Lei <ming.lei@xxxxxxxxxx> · Thu, 31 Aug 2017 00:58:53 +0800

On Wed, Aug 30, 2017 at 09:51:31AM -0600, Jens Axboe wrote:
> On 08/30/2017 09:39 AM, Ming Lei wrote:
> > On Wed, Aug 30, 2017 at 09:22:42AM -0600, Jens Axboe wrote:
> >> On 08/30/2017 09:19 AM, Ming Lei wrote:
> >>> It is more reasonable to add requests to ->dispatch in way
> >>> of FIFO style, instead of LIFO style.
> >>>
> >>> Also in this way, we can allow to insert request at the front
> >>> of hw queue, which function is needed to fix one bug
> >>> in blk-mq's implementation of blk_execute_rq()
> >>>
> >>> Reported-by: Oleksandr Natalenko <oleksandr@xxxxxxxxxxxxxx>
> >>> Tested-by: Oleksandr Natalenko <oleksandr@xxxxxxxxxxxxxx>
> >>> Signed-off-by: Ming Lei <ming.lei@xxxxxxxxxx>
> >>> ---
> >>>  block/blk-mq-sched.c | 2 +-
> >>>  block/blk-mq.c       | 2 +-
> >>>  2 files changed, 2 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> >>> index 4ab69435708c..8d97df40fc28 100644
> >>> --- a/block/blk-mq-sched.c
> >>> +++ b/block/blk-mq-sched.c
> >>> @@ -272,7 +272,7 @@ static bool blk_mq_sched_bypass_insert(struct blk_mq_hw_ctx *hctx,
> >>>  	 * the dispatch list.
> >>>  	 */
> >>>  	spin_lock(&hctx->lock);
> >>> -	list_add(&rq->queuelist, &hctx->dispatch);
> >>> +	list_add_tail(&rq->queuelist, &hctx->dispatch);
> >>>  	spin_unlock(&hctx->lock);
> >>>  	return true;
> >>>  }
> >>> diff --git a/block/blk-mq.c b/block/blk-mq.c
> >>> index 4603b115e234..fed3d0c16266 100644
> >>> --- a/block/blk-mq.c
> >>> +++ b/block/blk-mq.c
> >>> @@ -1067,7 +1067,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list)
> >>>  		blk_mq_put_driver_tag(rq);
> >>>  
> >>>  		spin_lock(&hctx->lock);
> >>> -		list_splice_init(list, &hctx->dispatch);
> >>> +		list_splice_tail_init(list, &hctx->dispatch);
> >>>  		spin_unlock(&hctx->lock);
> >>
> >> I'm not convinced this is safe, there's actually a reason why the
> >> request is added to the front and not the back. We do have
> >> reorder_tags_to_front() as a safe guard, but I'd much rather get rid of
> > 
> > reorder_tags_to_front() is for reordering the requests in current list,
> > this patch is for splicing list into hctx->dispatch, so I can't see
> > it isn't safe, or could you explain it a bit?
> 
> If we can get the ordering right, then down the line we won't need to
> have the tags reordering at all. It's an ugly hack that I'd love to see
> go away.

If reorder_tags_to_front() isn't removed, this patch is still safe.

But blk_execute_rq_nowait() need to add one request in the front
of hw queue, that can be a contradiction compared with maintaining
a perfect order for removing reorder_tags_to_front().

So could you share your opinion on the 2nd patch for fixing
blk_execute_rq_nowait()?

> 
> >> that than make this change.
> >>
> >> What's your reasoning here? Your changelog doesn't really explain why
> > 
> > Firstly the 2nd patch need to add one rq(such as RQF_PM) to the
> > front of the hw queue, the simple way is to add it to the front
> > of hctx->dispatch. Without this change, the 2nd patch can't work
> > at all.
> > 
> > Secondly this way is still reasonable:
> > 
> > 	- one rq is added to hctx->dispatch because queue is busy
> > 	- another rq is added to hctx->dispatch too because of same reason
> >
> > so it is reasonable to to add list into hctx->dispatch in FIFO style.
> 
> Not disagreeing with the logic. But it also begs the question of why we
> don't apply the same treatment to when we splice leftovers to the
> dispatch list, currently we front splice that.
> 
> All I'm saying is that you need to tread very carefully with this, and
> throw it through some careful testing to ensure that we don't introduce
> conditions that now livelock. NVMe is the easy test case, that will

Yes, ->dispatch is far away from NVMe, but friends of SCSI-MQ.

> generally always work since we never run out of tags. The problematic
> test case is usually things like SATA with 31 tags, and especially SATA
> with flushes that don't queue. One good test case is the one where you
> end up having all tags (or almost all) consumed by flushes, and still
> ensuring that we're making forward progress.

Understood!

Even we can make the test more aggressive.

I just setup one virtio-scsi by changing both 'can_queue' and
'cmd_per_lun' as 1, that means the hw queue depth is 1, and
hw queue number is set as 1.

Then I run 'dbench -t 30 -s -F 64' in ext4 which is over
this virtio-scsi device.

The dbench(64 jobs, sync write, fsync) just works fine with
this patch applied.

-- 
Ming