On 11/13/19 2:36 PM, John Garry wrote: > Some SCSI HBAs (such as HPSA, megaraid, mpt3sas, hisi_sas_v3 ..) support > multiple reply queues with single hostwide tags. > > In addition, these drivers want to use interrupt assignment in > pci_alloc_irq_vectors(PCI_IRQ_AFFINITY). However, as discussed in [0], > CPU hotplug may cause in-flight IO completion to not be serviced when an > interrupt is shutdown. > > To solve that problem, Ming's patchset to drain hctx's should ensure no > IOs are missed in-flight [1]. > > However, to take advantage of that patchset, we need to map the HBA HW > queues to blk mq hctx's; to do that, we need to expose the HBA HW queues. > > In making that transition, the per-SCSI command request tags are no > longer unique per Scsi host - they are just unique per hctx. As such, the > HBA LLDD would have to generate this tag internally, which has a certain > performance overhead. > > However another problem is that blk mq assumes the host may accept > (Scsi_host.can_queue * #hw queue) commands. In [2], we removed the Scsi > host busy counter, which would stop the LLDD being sent more than > .can_queue commands; however, we should still ensure that the block layer > does not issue more than .can_queue commands to the Scsi host. > > To solve this problem, introduce a shared tags per blk_mq_tag_set, which > may be requested when allocating the tagset. > > New flag BLK_MQ_F_TAG_HCTX_SHARED should be set when requesting the > tagset. > > This is based on work originally from Ming Lei in [3]. > > [0] https://lore.kernel.org/linux-block/alpine.DEB.2.21.1904051331270.1802@xxxxxxxxxxxxxxxxxxxxxxx/ > [1] https://lore.kernel.org/linux-block/20191014015043.25029-1-ming.lei@xxxxxxxxxx/ > [2] https://lore.kernel.org/linux-scsi/20191025065855.6309-1-ming.lei@xxxxxxxxxx/ > [3] https://lore.kernel.org/linux-block/20190531022801.10003-1-ming.lei@xxxxxxxxxx/ > > Signed-off-by: John Garry <john.garry@xxxxxxxxxx> > --- > block/blk-core.c | 1 + > block/blk-flush.c | 2 + > block/blk-mq-debugfs.c | 2 +- > block/blk-mq-tag.c | 85 ++++++++++++++++++++++++++++++++++++++++++ > block/blk-mq-tag.h | 1 + > block/blk-mq.c | 61 +++++++++++++++++++++++++----- > block/blk-mq.h | 9 +++++ > include/linux/blk-mq.h | 3 ++ > include/linux/blkdev.h | 1 + > 9 files changed, 155 insertions(+), 10 deletions(-) > [ .. ] > @@ -396,15 +398,17 @@ static struct request *blk_mq_get_request(struct request_queue *q, > blk_mq_tag_busy(data->hctx); > } > > - tag = blk_mq_get_tag(data); > - if (tag == BLK_MQ_TAG_FAIL) { > - if (clear_ctx_on_error) > - data->ctx = NULL; > - blk_queue_exit(q); > - return NULL; > + if (data->hctx->shared_tags) { > + shared_tag = blk_mq_get_shared_tag(data); > + if (shared_tag == BLK_MQ_TAG_FAIL) > + goto err_shared_tag; > } > > - rq = blk_mq_rq_ctx_init(data, tag, data->cmd_flags, alloc_time_ns); > + tag = blk_mq_get_tag(data); > + if (tag == BLK_MQ_TAG_FAIL) > + goto err_tag; > + > + rq = blk_mq_rq_ctx_init(data, tag, shared_tag, data->cmd_flags, alloc_time_ns); > if (!op_is_flush(data->cmd_flags)) { > rq->elv.icq = NULL; > if (e && e->type->ops.prepare_request) { Why do you need to keep a parallel tag accounting between 'normal' and 'shared' tags here? Isn't is sufficient to get a shared tag only, and us that in lieo of the 'real' one? I would love to combine both, as then we can easily do a reverse mapping by using the 'tag' value to lookup the command itself, and can possibly do the 'scsi_cmd_priv' trick of embedding the LLDD-specific parts within the command. With this split we'll be wasting quite some memory there, as the possible 'tag' values are actually nr_hw_queues * shared_tags. Cheers, Hannes -- Dr. Hannes Reinecke Teamlead Storage & Networking hare@xxxxxxx +49 911 74053 688 SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg HRB 247165 (AG München), GF: Felix Imendörffer