Hi Hannes, Thanks for taking a crack at the issue. My comments below.. On Tue, 4 Apr 2017, 5:07am, Hannes Reinecke wrote: > Most legacy HBAs have a tagset per HBA, not per queue. To map > these devices onto block-mq this patch implements a new tagset > flag BLK_MQ_F_GLOBAL_TAGS, which will cause the tag allocator > to use just one tagset for all hardware queues. > > Signed-off-by: Hannes Reinecke <hare@xxxxxxxx> > --- > block/blk-mq-tag.c | 12 ++++++++---- > block/blk-mq.c | 10 ++++++++-- > include/linux/blk-mq.h | 1 + > 3 files changed, 17 insertions(+), 6 deletions(-) > > diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c > index e48bc2c..a14e76c 100644 > --- a/block/blk-mq-tag.c > +++ b/block/blk-mq-tag.c > @@ -276,9 +276,11 @@ static void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags, > void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset, > busy_tag_iter_fn *fn, void *priv) > { > - int i; > + int i, lim = tagset->nr_hw_queues; > > - for (i = 0; i < tagset->nr_hw_queues; i++) { > + if (tagset->flags & BLK_MQ_F_GLOBAL_TAGS) > + lim = 1; > + for (i = 0; i < lim; i++) { > if (tagset->tags && tagset->tags[i]) > blk_mq_all_tag_busy_iter(tagset->tags[i], fn, priv); > } > @@ -287,12 +289,14 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset, > > int blk_mq_reinit_tagset(struct blk_mq_tag_set *set) > { > - int i, j, ret = 0; > + int i, j, ret = 0, lim = set->nr_hw_queues; > > if (!set->ops->reinit_request) > goto out; > > - for (i = 0; i < set->nr_hw_queues; i++) { > + if (set->flags & BLK_MQ_F_GLOBAL_TAGS) > + lim = 1; > + for (i = 0; i < lim; i++) { > struct blk_mq_tags *tags = set->tags[i]; > > for (j = 0; j < tags->nr_tags; j++) { > diff --git a/block/blk-mq.c b/block/blk-mq.c > index 159187a..db96ed0 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -2061,6 +2061,10 @@ static bool __blk_mq_alloc_rq_map(struct blk_mq_tag_set *set, int hctx_idx) > { > int ret = 0; > > + if ((set->flags & BLK_MQ_F_GLOBAL_TAGS) && hctx_idx != 0) { > + set->tags[hctx_idx] = set->tags[0]; > + return true; > + } So, this effectively make all request allocations to the same NUMA node locality of the hctx_idx 0, correct? Is the performance hit you were talking about in the cover letter? Do you have any other alternatives in mind? Dynamic growing/shrinking tags/request-pool in hctx with a fixed base as start? One alternative that comes to my mind is to move the divvy up logic to SCSI (instead of LLD doing it), IOW: 1. Have SCSI set tag_set.queue_depth to can_queue/nr_hw_queues 2. Have blk_mq_unique_tag() (or a new i/f) returning "hwq * nr_hw_queue + rq->tag" That would make the tags linear in the can_queue space, but could result in poor use of LLD resource if a given hctx has used up all it's tags. On a related note, would not the current use of can_queue in SCSI lead to poor resource utilization in MQ cases? Like, block layer allocating nr_hw_queues * tags+request+driver_data.etc * can_queue, but SCSI limiting the number of requests to can_queue. BTW, if you would like me to try out this patch on my setup, please let me know. Regards, -Arun > set->tags[hctx_idx] = blk_mq_alloc_rq_map(set, hctx_idx, > set->queue_depth, set->reserved_tags); > if (!set->tags[hctx_idx]) > @@ -2080,8 +2084,10 @@ static void blk_mq_free_map_and_requests(struct blk_mq_tag_set *set, > unsigned int hctx_idx) > { > if (set->tags[hctx_idx]) { > - blk_mq_free_rqs(set, set->tags[hctx_idx], hctx_idx); > - blk_mq_free_rq_map(set->tags[hctx_idx]); > + if (!(set->flags & BLK_MQ_F_GLOBAL_TAGS) || hctx_idx == 0) { > + blk_mq_free_rqs(set, set->tags[hctx_idx], hctx_idx); > + blk_mq_free_rq_map(set->tags[hctx_idx]); > + } > set->tags[hctx_idx] = NULL; > } > } > diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h > index b296a90..eee27b016 100644 > --- a/include/linux/blk-mq.h > +++ b/include/linux/blk-mq.h > @@ -155,6 +155,7 @@ enum { > BLK_MQ_F_DEFER_ISSUE = 1 << 4, > BLK_MQ_F_BLOCKING = 1 << 5, > BLK_MQ_F_NO_SCHED = 1 << 6, > + BLK_MQ_F_GLOBAL_TAGS = 1 << 7, > BLK_MQ_F_ALLOC_POLICY_START_BIT = 8, > BLK_MQ_F_ALLOC_POLICY_BITS = 1, > >