Hi Ming,
In theory, you still may generate and manage the IPTT in the LLDD by
simply ignoring rq->tag, meantime enabling SCSI_MQ with 16 hw queues.
Well at the moment we can't expose all 16 hw queues to upper layer
anyway, due to ordering restiction imposed by HW on LLDD. We have a plan
to solve it.
Regardless, we still found performance better by using rq tag instead of
exposing all 16 queues.
However, not sure how much this way may improve performance, and it may
degrade IO perf. If 16 hw queues are exposed to blk-mq, 16*.can_queue
requests may be queued to the driver, and allocation & free on the single
IPTT pool will become a bottleneck.
Right, this IPTT management doesn't scale (especially for our host with
2 sockets @ 48/64 cores each). So if upper layer is already generating a
tag which we can use, good to use it.
Per my experiment on host tagset, it might be a good tradeoff to allocate
one hw queue for each node to avoid the remote access on dispatch
data/requests structure for this case, but your IPTT pool is still
shared all CPUs, maybe you can try the smart sbitmap.
https://www.spinics.net/lists/linux-scsi/msg117920.html
I don't understand this about allocating a hw queue per node. Surely
having and using all available queues in an intelligent way means less
queue contention, right?
Looking at this change:
@@ -5761,6 +5762,11 @@ static int hpsa_scsi_host_alloc(struct ctlr_info *h)
static int hpsa_scsi_add_host(struct ctlr_info *h)
{
int rv;
+ /* 256 tags should be high enough to saturate device */
+ int max_queues = DIV_ROUND_UP(h->scsi_host->can_queue, 256);
+
+ /* per NUMA node hw queue */
+ h->scsi_host->nr_hw_queues = min_t(int, nr_node_ids, max_queues);
I assume h->scsi_host->nr_hw_queues was zero-init previously, so we're
now using > 1 queue, but why limit?
IFF you device needs different tags for different queues it can use
the blk_mq_unique_tag heper to generate unique global tag.
So this helper can't help, as fundamentially the issue is "the tag field in
struct request is unique per hardware queue but not all all hw queues".
Indeed blk_mq_unique_tag() does give a unique global tag, but cannot be used
for the IPTT.
OTOH, We could expose 16 queues to upper layer, and drop 6/7, but we found
it performs worse.
We discussed this issue before, but not found a good solution yet for
exposing multiple hw queues to blk-mq.
I just think that it's unfortunate that enabling blk-mq means that the LLDD
loses this unique tag across all queues in range [0, Scsi_host.can_queue),
so much so that we found performance better by not exposing multiple queues
and continuing to use single rq tag...
It isn't a new problem, we discussed it a lot on megaraid_sas which has
same situation with yours, you may find it in block list.
I'll do some digging.
Kashyap Desai did lots of test on this case.
However, we still get good performance in case of none scheduler by the
following patches:
8824f62246be blk-mq: fail the request in case issue failure
6ce3dd6eec11 blk-mq: issue directly if hw queue isn't busy in case of 'none'
I think that these patches would have been included in our testing. I need
to check.
Please switch to none io sched in your test, and it is observed that IO
perf becomes good on megaraid_sas.
Hmmm... I thought that deadline was preferred.
Thanks,
Ming
.
Much appreciated,
John