On 11/13/19 3:57 PM, John Garry wrote: > On 13/11/2019 14:06, Hannes Reinecke wrote: >> On 11/13/19 2:36 PM, John Garry wrote: >>> Some SCSI HBAs (such as HPSA, megaraid, mpt3sas, hisi_sas_v3 ..) support >>> multiple reply queues with single hostwide tags. >>> >>> In addition, these drivers want to use interrupt assignment in >>> pci_alloc_irq_vectors(PCI_IRQ_AFFINITY). However, as discussed in [0], >>> CPU hotplug may cause in-flight IO completion to not be serviced when an >>> interrupt is shutdown. >>> >>> To solve that problem, Ming's patchset to drain hctx's should ensure no >>> IOs are missed in-flight [1]. >>> >>> However, to take advantage of that patchset, we need to map the HBA HW >>> queues to blk mq hctx's; to do that, we need to expose the HBA HW >>> queues. >>> >>> In making that transition, the per-SCSI command request tags are no >>> longer unique per Scsi host - they are just unique per hctx. As such, >>> the >>> HBA LLDD would have to generate this tag internally, which has a certain >>> performance overhead. >>> >>> However another problem is that blk mq assumes the host may accept >>> (Scsi_host.can_queue * #hw queue) commands. In [2], we removed the Scsi >>> host busy counter, which would stop the LLDD being sent more than >>> .can_queue commands; however, we should still ensure that the block >>> layer >>> does not issue more than .can_queue commands to the Scsi host. >>> >>> To solve this problem, introduce a shared tags per blk_mq_tag_set, which >>> may be requested when allocating the tagset. >>> >>> New flag BLK_MQ_F_TAG_HCTX_SHARED should be set when requesting the >>> tagset. >>> >>> This is based on work originally from Ming Lei in [3]. >>> >>> [0] >>> https://lore.kernel.org/linux-block/alpine.DEB.2.21.1904051331270.1802@xxxxxxxxxxxxxxxxxxxxxxx/ >>> >>> [1] >>> https://lore.kernel.org/linux-block/20191014015043.25029-1-ming.lei@xxxxxxxxxx/ >>> >>> [2] >>> https://lore.kernel.org/linux-scsi/20191025065855.6309-1-ming.lei@xxxxxxxxxx/ >>> >>> [3] >>> https://lore.kernel.org/linux-block/20190531022801.10003-1-ming.lei@xxxxxxxxxx/ >>> >>> >>> Signed-off-by: John Garry <john.garry@xxxxxxxxxx> >>> --- >>> block/blk-core.c | 1 + >>> block/blk-flush.c | 2 + >>> block/blk-mq-debugfs.c | 2 +- >>> block/blk-mq-tag.c | 85 ++++++++++++++++++++++++++++++++++++++++++ >>> block/blk-mq-tag.h | 1 + >>> block/blk-mq.c | 61 +++++++++++++++++++++++++----- >>> block/blk-mq.h | 9 +++++ >>> include/linux/blk-mq.h | 3 ++ >>> include/linux/blkdev.h | 1 + >>> 9 files changed, 155 insertions(+), 10 deletions(-) >>> >> [ .. ] >>> @@ -396,15 +398,17 @@ static struct request >>> *blk_mq_get_request(struct request_queue *q, >>> blk_mq_tag_busy(data->hctx); >>> } >>> - tag = blk_mq_get_tag(data); >>> - if (tag == BLK_MQ_TAG_FAIL) { >>> - if (clear_ctx_on_error) >>> - data->ctx = NULL; >>> - blk_queue_exit(q); >>> - return NULL; >>> + if (data->hctx->shared_tags) { >>> + shared_tag = blk_mq_get_shared_tag(data); >>> + if (shared_tag == BLK_MQ_TAG_FAIL) >>> + goto err_shared_tag; >>> } >>> - rq = blk_mq_rq_ctx_init(data, tag, data->cmd_flags, >>> alloc_time_ns); >>> + tag = blk_mq_get_tag(data); >>> + if (tag == BLK_MQ_TAG_FAIL) >>> + goto err_tag; >>> + >>> + rq = blk_mq_rq_ctx_init(data, tag, shared_tag, data->cmd_flags, >>> alloc_time_ns); >>> if (!op_is_flush(data->cmd_flags)) { >>> rq->elv.icq = NULL; >>> if (e && e->type->ops.prepare_request) { > > Hi Hannes, > >> Why do you need to keep a parallel tag accounting between 'normal' and >> 'shared' tags here? >> Isn't is sufficient to get a shared tag only, and us that in lieo of the >> 'real' one? > > In theory, yes. Just the 'shared' tag should be adequate. > > A problem I see with this approach is that we lose the identity of which > tags are allocated for each hctx. As an example for this, consider > blk_mq_queue_tag_busy_iter(), which iterates the bits for each hctx. > Now, if you're just using shared tags only, that wouldn't work. > > Consider blk_mq_can_queue() as another example - this tells us if a hctx > has any bits unset, while with only using shared tags it would tell if > any bits unset over all queues, and this change in semantics could break > things. At a glance, function __blk_mq_tag_idle() looks problematic also. > > And this is where it becomes messy to implement. > Oh, my. Indeed, that's correct. But then we don't really care _which_ shared tag is assigned; so wouldn't be we better off by just having an atomic counter here? Cache locality will be blown anyway ... Cheers, Hannes -- Dr. Hannes Reinecke Teamlead Storage & Networking hare@xxxxxxx +49 911 74053 688 SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg HRB 247165 (AG München), GF: Felix Imendörffer