Re: [PATCH RFC 3/5] blk-mq: Facilitate a shared tags per tagset

Hannes Reinecke <hare@xxxxxxx> · Wed, 13 Nov 2019 15:06:12 +0100

On 11/13/19 2:36 PM, John Garry wrote:
> Some SCSI HBAs (such as HPSA, megaraid, mpt3sas, hisi_sas_v3 ..) support
> multiple reply queues with single hostwide tags.
> 
> In addition, these drivers want to use interrupt assignment in
> pci_alloc_irq_vectors(PCI_IRQ_AFFINITY). However, as discussed in [0],
> CPU hotplug may cause in-flight IO completion to not be serviced when an
> interrupt is shutdown.
> 
> To solve that problem, Ming's patchset to drain hctx's should ensure no
> IOs are missed in-flight [1].
> 
> However, to take advantage of that patchset, we need to map the HBA HW
> queues to blk mq hctx's; to do that, we need to expose the HBA HW queues.
> 
> In making that transition, the per-SCSI command request tags are no
> longer unique per Scsi host - they are just unique per hctx. As such, the
> HBA LLDD would have to generate this tag internally, which has a certain
> performance overhead.
> 
> However another problem is that blk mq assumes the host may accept
> (Scsi_host.can_queue * #hw queue) commands. In [2], we removed the Scsi
> host busy counter, which would stop the LLDD being sent more than
> .can_queue commands; however, we should still ensure that the block layer
> does not issue more than .can_queue commands to the Scsi host.
> 
> To solve this problem, introduce a shared tags per blk_mq_tag_set, which
> may be requested when allocating the tagset.
> 
> New flag BLK_MQ_F_TAG_HCTX_SHARED should be set when requesting the
> tagset.
> 
> This is based on work originally from Ming Lei in [3].
> 
> [0] https://lore.kernel.org/linux-block/alpine.DEB.2.21.1904051331270.1802@xxxxxxxxxxxxxxxxxxxxxxx/
> [1] https://lore.kernel.org/linux-block/20191014015043.25029-1-ming.lei@xxxxxxxxxx/
> [2] https://lore.kernel.org/linux-scsi/20191025065855.6309-1-ming.lei@xxxxxxxxxx/
> [3] https://lore.kernel.org/linux-block/20190531022801.10003-1-ming.lei@xxxxxxxxxx/
> 
> Signed-off-by: John Garry <john.garry@xxxxxxxxxx>
> ---
>  block/blk-core.c       |  1 +
>  block/blk-flush.c      |  2 +
>  block/blk-mq-debugfs.c |  2 +-
>  block/blk-mq-tag.c     | 85 ++++++++++++++++++++++++++++++++++++++++++
>  block/blk-mq-tag.h     |  1 +
>  block/blk-mq.c         | 61 +++++++++++++++++++++++++-----
>  block/blk-mq.h         |  9 +++++
>  include/linux/blk-mq.h |  3 ++
>  include/linux/blkdev.h |  1 +
>  9 files changed, 155 insertions(+), 10 deletions(-)
> 
[ .. ]
> @@ -396,15 +398,17 @@ static struct request *blk_mq_get_request(struct request_queue *q,
>  		blk_mq_tag_busy(data->hctx);
>  	}
>  
> -	tag = blk_mq_get_tag(data);
> -	if (tag == BLK_MQ_TAG_FAIL) {
> -		if (clear_ctx_on_error)
> -			data->ctx = NULL;
> -		blk_queue_exit(q);
> -		return NULL;
> +	if (data->hctx->shared_tags) {
> +		shared_tag = blk_mq_get_shared_tag(data);
> +		if (shared_tag == BLK_MQ_TAG_FAIL)
> +			goto err_shared_tag;
>  	}
>  
> -	rq = blk_mq_rq_ctx_init(data, tag, data->cmd_flags, alloc_time_ns);
> +	tag = blk_mq_get_tag(data);
> +	if (tag == BLK_MQ_TAG_FAIL)
> +		goto err_tag;
> +
> +	rq = blk_mq_rq_ctx_init(data, tag, shared_tag, data->cmd_flags, alloc_time_ns);
>  	if (!op_is_flush(data->cmd_flags)) {
>  		rq->elv.icq = NULL;
>  		if (e && e->type->ops.prepare_request) {
Why do you need to keep a parallel tag accounting between 'normal' and
'shared' tags here?
Isn't is sufficient to get a shared tag only, and us that in lieo of the
'real' one?

I would love to combine both, as then we can easily do a reverse mapping
by using the 'tag' value to lookup the command itself, and can possibly
do the 'scsi_cmd_priv' trick of embedding the LLDD-specific parts within
the command. With this split we'll be wasting quite some memory there,
as the possible 'tag' values are actually nr_hw_queues * shared_tags.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      Teamlead Storage & Networking
hare@xxxxxxx			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 247165 (AG München), GF: Felix Imendörffer