On Wed, Sep 20 2017 at 2:12pm -0400, Bart Van Assche <bart.vanassche@xxxxxxx> wrote: > Instead of retrying request allocation after a delay if a path is > busy, wait until the underlying path has completed a request. This > patch avoids that submission of requests to busy paths is delayed and > hence creates more opportunities for merging sequential I/O requests. > > Reported-by: Ming Lei <ming.lei@xxxxxxxxxx> > Signed-off-by: Bart Van Assche <bart.vanassche@xxxxxxx> > Cc: Christoph Hellwig <hch@xxxxxx> > Cc: Hannes Reinecke <hare@xxxxxxxx> > Cc: Jens Axboe <axboe@xxxxxxxxx> > Cc: Ming Lei <ming.lei@xxxxxxxxxx> > --- > drivers/md/dm-mpath.c | 10 +++++++++- > drivers/md/dm-rq.c | 3 ++- > 2 files changed, 11 insertions(+), 2 deletions(-) > > diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c > index 11f273d2f018..b60ef655fa0c 100644 > --- a/drivers/md/dm-mpath.c > +++ b/drivers/md/dm-mpath.c > @@ -495,7 +495,13 @@ static int multipath_clone_and_map(struct dm_target *ti, struct request *rq, > > bdev = pgpath->path.dev->bdev; > q = bdev_get_queue(bdev); > - clone = blk_get_request(q, rq->cmd_flags | REQ_NOMERGE, GFP_ATOMIC); > + /* > + * For blk-mq this code is called from inside dm_mq_queue_rq() and > + * sleeping is allowed due to BLK_MQ_F_BLOCKING. For the legacy block > + * layer this code is called from workqueue context. Hence unlocking > + * and reacquiring the queue lock is not necessary. > + */ > + clone = blk_get_request(q, rq->cmd_flags | REQ_NOMERGE, GFP_NOIO); > if (IS_ERR(clone)) { > /* EBUSY, ENODEV or EWOULDBLOCK: requeue */ > bool queue_dying = blk_queue_dying(q); It concerns me that we'd basically be allowing one or more really poorly behaving underlying path(s) to sabotage the entire dm-multipath device by blocking waiting for a request/tag on potentially pathologically saturated and/or substandard path(s) (of particular concern when using the round-robin path selector). > @@ -504,6 +510,8 @@ static int multipath_clone_and_map(struct dm_target *ti, struct request *rq, > if (queue_dying) { > atomic_inc(&m->pg_init_in_progress); > activate_or_offline_path(pgpath); > + } else { > + WARN_ON_ONCE(true); > } > return DM_MAPIO_DELAY_REQUEUE; > } There is no reason for this WARN_ON_ONCE to be part of this patch. Really not seeing the point of it at all though. > diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c > index eadfcfd106ff..f25a3cdb7f84 100644 > --- a/drivers/md/dm-rq.c > +++ b/drivers/md/dm-rq.c > @@ -789,7 +789,8 @@ int dm_mq_init_request_queue(struct mapped_device *md, struct dm_table *t) > md->tag_set->ops = &dm_mq_ops; > md->tag_set->queue_depth = dm_get_blk_mq_queue_depth(); > md->tag_set->numa_node = md->numa_node_id; > - md->tag_set->flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE; > + md->tag_set->flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE | > + BLK_MQ_F_BLOCKING; > md->tag_set->nr_hw_queues = dm_get_blk_mq_nr_hw_queues(); > md->tag_set->driver_data = md; > > -- > 2.14.1 > This type of change needs to be carefully tested. It is very fundamental and takes us further and further away from avoiding deadlocks/stalls. Have you run this change through your IB SRP test harness? I can run it through the mptest suite but that falls way short of real-world destructive multipath testing in the face of heavy IO. Mike -- dm-devel mailing list dm-devel@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/dm-devel