Re: dm-mpath: Improve handling of busy paths

Mike Snitzer <snitzer@xxxxxxxxxx> · Wed, 20 Sep 2017 14:56:37 -0400

On Wed, Sep 20 2017 at  2:12pm -0400,
Bart Van Assche <bart.vanassche@xxxxxxx> wrote:

> Instead of retrying request allocation after a delay if a path is
> busy, wait until the underlying path has completed a request. This
> patch avoids that submission of requests to busy paths is delayed and
> hence creates more opportunities for merging sequential I/O requests.
> 
> Reported-by: Ming Lei <ming.lei@xxxxxxxxxx>
> Signed-off-by: Bart Van Assche <bart.vanassche@xxxxxxx>
> Cc: Christoph Hellwig <hch@xxxxxx>
> Cc: Hannes Reinecke <hare@xxxxxxxx>
> Cc: Jens Axboe <axboe@xxxxxxxxx>
> Cc: Ming Lei <ming.lei@xxxxxxxxxx>
> ---
>  drivers/md/dm-mpath.c | 10 +++++++++-
>  drivers/md/dm-rq.c    |  3 ++-
>  2 files changed, 11 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
> index 11f273d2f018..b60ef655fa0c 100644
> --- a/drivers/md/dm-mpath.c
> +++ b/drivers/md/dm-mpath.c
> @@ -495,7 +495,13 @@ static int multipath_clone_and_map(struct dm_target *ti, struct request *rq,
>  
>  	bdev = pgpath->path.dev->bdev;
>  	q = bdev_get_queue(bdev);
> -	clone = blk_get_request(q, rq->cmd_flags | REQ_NOMERGE, GFP_ATOMIC);
> +	/*
> +	 * For blk-mq this code is called from inside dm_mq_queue_rq() and
> +	 * sleeping is allowed due to BLK_MQ_F_BLOCKING.  For the legacy block
> +	 * layer this code is called from workqueue context. Hence unlocking
> +	 * and reacquiring the queue lock is not necessary.
> +	 */
> +	clone = blk_get_request(q, rq->cmd_flags | REQ_NOMERGE, GFP_NOIO);
>  	if (IS_ERR(clone)) {
>  		/* EBUSY, ENODEV or EWOULDBLOCK: requeue */
>  		bool queue_dying = blk_queue_dying(q);

It concerns me that we'd basically be allowing one or more really poorly
behaving underlying path(s) to sabotage the entire dm-multipath device
by blocking waiting for a request/tag on potentially pathologically
saturated and/or substandard path(s) (of particular concern when using
the round-robin path selector).

> @@ -504,6 +510,8 @@ static int multipath_clone_and_map(struct dm_target *ti, struct request *rq,
>  		if (queue_dying) {
>  			atomic_inc(&m->pg_init_in_progress);
>  			activate_or_offline_path(pgpath);
> +		} else {
> +			WARN_ON_ONCE(true);
>  		}
>  		return DM_MAPIO_DELAY_REQUEUE;
>  	}

There is no reason for this WARN_ON_ONCE to be part of this patch.
Really not seeing the point of it at all though.

> diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
> index eadfcfd106ff..f25a3cdb7f84 100644
> --- a/drivers/md/dm-rq.c
> +++ b/drivers/md/dm-rq.c
> @@ -789,7 +789,8 @@ int dm_mq_init_request_queue(struct mapped_device *md, struct dm_table *t)
>  	md->tag_set->ops = &dm_mq_ops;
>  	md->tag_set->queue_depth = dm_get_blk_mq_queue_depth();
>  	md->tag_set->numa_node = md->numa_node_id;
> -	md->tag_set->flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE;
> +	md->tag_set->flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE |
> +		BLK_MQ_F_BLOCKING;
>  	md->tag_set->nr_hw_queues = dm_get_blk_mq_nr_hw_queues();
>  	md->tag_set->driver_data = md;
>  
> -- 
> 2.14.1
> 

This type of change needs to be carefully tested.  It is very
fundamental and takes us further and further away from avoiding
deadlocks/stalls.

Have you run this change through your IB SRP test harness?

I can run it through the mptest suite but that falls way short of
real-world destructive multipath testing in the face of heavy IO.

Mike

--
dm-devel mailing list
dm-devel@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/dm-devel