Re: blk-mq: introduce BLK_STS_DEV_RESOURCE

Mike Snitzer <snitzer@xxxxxxxxxx> · Sat, 20 Jan 2018 22:52:01 -0500

On Sat, Jan 20 2018 at  7:57pm -0500,
Ming Lei <ming.lei@xxxxxxxxxx> wrote:

> On Sat, Jan 20, 2018 at 12:30:02PM -0500, Mike Snitzer wrote:
> > On Sat, Jan 20 2018 at  8:48am -0500,
> > Ming Lei <ming.lei@xxxxxxxxxx> wrote:
> > 
> > > This status is returned from driver to block layer if device related
> > > resource is run out of, but driver can guarantee that IO dispatch is
> > > triggered in future when the resource is available.
> > > 
> > > This patch converts some drivers to use this return value. Meantime
> > > if driver returns BLK_STS_RESOURCE and S_SCHED_RESTART is marked, run
> > > queue after 10ms for avoiding IO hang.
> > > 
> > > Suggested-by: Jens Axboe <axboe@xxxxxxxxx>
> > > Cc: Mike Snitzer <snitzer@xxxxxxxxxx>
> > > Cc: Laurence Oberman <loberman@xxxxxxxxxx>
> > > Cc: Bart Van Assche <bart.vanassche@xxxxxxxxxxx>
> > > Signed-off-by: Ming Lei <ming.lei@xxxxxxxxxx>
> > > ---
> > >  block/blk-core.c             |  1 +
> > >  block/blk-mq.c               | 20 ++++++++++++++++----
> > >  drivers/block/null_blk.c     |  2 +-
> > >  drivers/block/virtio_blk.c   |  2 +-
> > >  drivers/block/xen-blkfront.c |  2 +-
> > >  drivers/scsi/scsi_lib.c      |  6 +++---
> > >  include/linux/blk_types.h    |  7 +++++++
> > >  7 files changed, 30 insertions(+), 10 deletions(-)
> > > 
> > > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > > index 01f271d40825..6e97e0bf8178 100644
> > > --- a/block/blk-mq.c
> > > +++ b/block/blk-mq.c
> > > @@ -1226,7 +1226,8 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
> > >  		}
> > >  
> > >  		ret = q->mq_ops->queue_rq(hctx, &bd);
> > > -		if (ret == BLK_STS_RESOURCE) {
> > > +		if ((ret == BLK_STS_RESOURCE) ||
> > > +				(ret == BLK_STS_DEV_RESOURCE)) {
> > >  			/*
> > >  			 * If an I/O scheduler has been configured and we got a
> > >  			 * driver tag for the next request already, free it
> > 
> > Just a nit, but this should be on one line.
> 
> It is too long, and my editor starts to highlight/complain it, :-)

Look at the lines immediately following it, your isn't longer than
them..

> > > @@ -1764,6 +1775,7 @@ static blk_status_t __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx,
> > >  		*cookie = new_cookie;
> > >  		break;
> > >  	case BLK_STS_RESOURCE:
> > > +	case BLK_STS_DEV_RESOURCE:
> > >  		__blk_mq_requeue_request(rq);
> > >  		break;
> > >  	default:
> > 
> > It seems the strategy for BLK_STS_DEV_RESOURCE and BLK_STS_RESOURCE is
> > too muddled: calling __blk_mq_requeue_request() for both will cause
> > underlying blk-mq driver to retain the request, won't it?
> 
> blk_mq_request_issue_directly() is used by driver(dm-rq) on underlying
> queue, and driver need to deal with underlying queue busy, now we simply
> free the (underlying)request and feedback the busy status to blk-mq via
> dm-rq.
> 
> Except for blk_mq_request_issue_directly(), this request need to be
> requeued, and is retained by blk-mq in hctx->dispatch_list.
> 
> The difference is that if driver returns BLK_STS_DEV_RESOURCE, the queue
> will be rerun when resource is available, so don't need to run queue after
> a delay for avoiding IO hang explicitly.

Yes, I understand the intent.

> > > @@ -1826,7 +1838,7 @@ static void blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
> > >  	hctx_lock(hctx, &srcu_idx);
> > >  
> > >  	ret = __blk_mq_try_issue_directly(hctx, rq, cookie, false);
> > > -	if (ret == BLK_STS_RESOURCE)
> > > +	if ((ret == BLK_STS_RESOURCE) || (ret == BLK_STS_DEV_RESOURCE))
> > >  		blk_mq_sched_insert_request(rq, false, true, false);
> > >  	else if (ret != BLK_STS_OK)
> > >  		blk_mq_end_request(rq, ret);
> > 
> > For this normal (non dm-mpath) case the request gets re-inserted;
> > dm-mpath must avoid that.
> > 
> > But with dm-mpath, which instead uses blk_mq_request_issue_directly(),
> > we're driving IO with stacked blk-mq drivers.  If the underlying blk-mq 
> > driver (e.g. scsi-mq or nvme) is made to retain the request, using
> > __blk_mq_issue_directly()'s call to __blk_mq_requeue_request() above,
> > then dm-mpath will not have the ability to requeue the request without
> > conflicting with the underlying blk-mq driver, will it?
> 
> No, as I explained, the exception is blk_mq_request_issue_directly(),
> and now dm-rq simply frees it(and in my original version, this request
> is cached for underlying queue, and reused in next dispatch), for others,
> the request is retained in hctx->dispatch_list, and owned by blk-mq.
>
> > Or am I'm misunderstanding what __blk_mq_requeue_request() is doing?

OK I was misunderstanding __blk_mq_requeue_request().  Seems
__blk_mq_requeue_request() is effectively resetting a request for
reuse.  Not retaining the request for reissue.

> > dm_mq_queue_rq
> > -> multipath_clone_and_map
> >    -> blk_get_request (scsi_mq)
> >       -> if error, dm-mpath conditionally requeues (w/ or w/o delay)
> 
> Yes, with this patch, most of times blk-mq will run queue w/ delay
> because SCHED_RESTART is set after the 1st STS_RESOURCE from dm-rq
> .queue_rq()
> 
> >       -> if BLK_STS_OK then blk_mq_request_issue_directly() gets called
> > -> dm_dispatch_clone_request
> >    -> blk_mq_request_issue_directly
> >       -> __blk_mq_try_issue_directly
> >          -> __blk_mq_issue_directly
> >             -> q->mq_ops->queue_rq (this is the underlying scsi_mq)
> >                -> a BLK_STS_RESOURCE return here is how Bart was able to cause stalls
> 
> The stall only happens when SCHED_RESTART is set and the dm-rq queue is
> idle(no any in-flight requests), that is exactly what this patch tries to
> address as suggested by Jens.
> 
> >             -> __blk_mq_requeue_request, if BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE **1
> >    -> (return from blk_mq_request_issue_directly)
> >    -> if BLK_STS_RESOURCE, the dm-mpath request is released using blk_put_request();
> >                            and DM_MAPIO_REQUEUE is returned to dm_mq_queue_rq **2
> 
> Right.
> 
> > -> if DM_MAPIO_REQUEUE return from map_request()'s call to dm_dispatch_clone_request:
> >    BLK_STS_RESOURCE is returned from dm-mpath's dm_mq_queue_rq
> 
> Right.
> 
> > 
> > The redundant queueing (both to underlying blk-mq at **1 above, and
> > upper layer blk-mq at **2 above) is what I'm concerned about.
> > 
> > Hope this is clear.
> 
> Yeah, it is quite clear.

Well I thought there was a problem with __blk_mq_issue_directly calling
__blk_mq_requeue_request for dm-rq on scsi-mq.. guess not.

> I also have other dm-mpath specific questions:
> 
> 1) when any STS_RESOURCE is returned from underlying queue either
> because of blk_get_request() or underlying .queue_rq() for a while,
> will dm-mpath try to switch to other path?

If dm_requeue_original_request() is used, and then we reenter
multipath_clone_and_map() then a new path may be selected.  Depending on
whether there are multiple paths and which path selector is being used.

> 2) what is the basic path switch policy of dm-mpath?

It is selectable, there are 3:
round-robin, queue-length and service-time

A path will be failed in multipath_end_io() if IO returns with a
retryable error (as determined by blk_path_error()).  This ensures the
path that experienced the failure will not be selected again until
reinstated.

> 3) is it possible to move the check of 'ti->type->busy' into
> .clone_and_map_rq()? if it is possible, this way might be more
> effective to detect underlying queue busy.

Can you elaborate further why it'd make a difference?

It'd be somewhat odd to push that busy check down into
.clone_and_map_rq().

Are you looking to check the selected path's blk-mq queue before calling
blk_get_request()?  Why is that any different than blk_get_request()
returning busy?

> Actually this patch may has another issue: if the default run queue
> delay(in this patch, it is 10ms) is too short, the timer may expire
> before any in-flight underlying request completes, then we may
> dequeue too quick, and sequential IO performance can be hurt too.

That's the problem with any arbitrary timer based solution.  No value
that you pick is going to be correct for all hardware.

> But my previous patch in github doesn't have this issue.
> 
> 	https://github.com/ming1/linux/commit/dfd672c998283a110247152237a9916b8264f3ec
> 
> Jens, what do you think of this issue? Or do we need to worry about
> it?

As your follow-up reply pointed out, blk_get_request_notify() is likely
the better way forward to address this race.

Mike