scsi-mq - tag# and can_queue, performance.

Arun Easi <arun.easi@xxxxxxxxxx> · Sun, 2 Apr 2017 23:37:50 -0700 (PDT)

Hi Folks,

I would like to seek your input on a few topics on SCSI / block 
multi-queue.

1. Tag# generation.

The context is with SCSI MQ on. My question is, what should a LLD do to 
get request tag values in the range 0 through can_queue - 1 across *all* 
of the queues. In our QLogic 41XXX series of adapters, we have a per 
session submit queue, a shared task memory (shared across all queues) and 
N completion queues (separate MSI-X vectors). We report N as the 
nr_hw_queues. I would like to, if possible, use the block layer tags to 
index into the above shared task memory area.

>From looking at the scsi/block source, it appears that when a LLD reports 
a value say #C, in can_queue (via scsi_host_template), that value is used 
as the max depth when corresponding block layer queues are created. So, 
while SCSI restricts the number of commands to LLD at #C, the request tag 
generated across any of the queues can range from 0..#C-1. Please correct 
me if I got this wrong.

If the above is true, then for a LLD to get tag# within it's max-tasks 
range, it has to report max-tasks / number-of-hw-queues in can_queue, and 
in the I/O path, use the tag and hwq# to arrive at a index# to use. This, 
though, leads to a poor use of tag resources -- queue reaching it's 
capacity while LLD can still take it.

blk_mq_unique_tag() would not work here, because it just puts the hwq# in 
the upper 16 bits, which need not fall in the max-tasks range.

Perhaps the current MQ model is to cater to a queue pair 
(submit/completion) kind of hardware model; nevertheless I would like to 
know how other hardware variants can makes use of it.

2. mq vs non-mq performance gain.

This is more like a poll, I guess. I was wondering what performance gains 
folks are observing with SCSI MQ on. I saw Christoph H.'s slide deck that 
has one slide that shows a 200k IOPS gain.

>From my testing, though, I was not lucky to observe that big of a change. 
In fact, the difference was not even noticeable(*). For e.g., for 512 
bytes random read test, in both cases, gave me in the vicinity of 2M IOPs. 
When I say both cases, I meant, one with scsi_mod's use_blk_mq set to 0 
and another with 1 (LLD is reloaded when it is done). I only used one NUMA 
node for this run. The test was run on a x86_64 setup.

* See item 3 for a special handling.

3. add_random slowness.

One thing I observed with MQ on and off was this block layer tunable, 
add_random, which as I understand is to tune disk entropy contribution. 
With non-MQ, it is turned on, and with MQ, it is turned off by default.

This got noticed because, when I was running multi-port testing, there was 
a big drop in IOPs with and without MQ (~200K IOPS to 1M+ IOPs when the 
test ran on same NUMA node / across NUMA nodes).

Just wondering why we have it ON on one setting and OFF on another.

Sorry for the rather long e-mail, but your comments/thoughts are much 
appreciated.

Regards,
-Arun