Re: scsi-mq - tag# and can_queue, performance.

Arun Easi <arun.easi@xxxxxxxxxx> · Mon, 3 Apr 2017 09:31:09 -0700 (PDT)

On Mon, 3 Apr 2017, 12:29am, Hannes Reinecke wrote:

> On 04/03/2017 08:37 AM, Arun Easi wrote:
> > Hi Folks,
> > 
> > I would like to seek your input on a few topics on SCSI / block 
> > multi-queue.
> > 
> > 1. Tag# generation.
> > 
> > The context is with SCSI MQ on. My question is, what should a LLD do to 
> > get request tag values in the range 0 through can_queue - 1 across *all* 
> > of the queues. In our QLogic 41XXX series of adapters, we have a per 
> > session submit queue, a shared task memory (shared across all queues) and 
> > N completion queues (separate MSI-X vectors). We report N as the 
> > nr_hw_queues. I would like to, if possible, use the block layer tags to 
> > index into the above shared task memory area.
> > 
> > From looking at the scsi/block source, it appears that when a LLD reports 
> > a value say #C, in can_queue (via scsi_host_template), that value is used 
> > as the max depth when corresponding block layer queues are created. So, 
> > while SCSI restricts the number of commands to LLD at #C, the request tag 
> > generated across any of the queues can range from 0..#C-1. Please correct 
> > me if I got this wrong.
> > 
> > If the above is true, then for a LLD to get tag# within it's max-tasks 
> > range, it has to report max-tasks / number-of-hw-queues in can_queue, and 
> > in the I/O path, use the tag and hwq# to arrive at a index# to use. This, 
> > though, leads to a poor use of tag resources -- queue reaching it's 
> > capacity while LLD can still take it.
> > 
> Yep.
> 
> > blk_mq_unique_tag() would not work here, because it just puts the hwq# in 
> > the upper 16 bits, which need not fall in the max-tasks range.
> > 
> > Perhaps the current MQ model is to cater to a queue pair 
> > (submit/completion) kind of hardware model; nevertheless I would like to 
> > know how other hardware variants can makes use of it.
> > 
> He. Welcome to the club.
> 
> Shared tag sets continue to dog the block-mq on 'legacy' (ie non-NVMe)
> HBAs. ATM the only 'real' solution to this problem is indeed have a
> static split of the entire tag space by the number of hardware queues.
> With the mentioned tag-starvation problem.
> 
> If we were to continue with the tag to hardware ID mapping, we would
> need to implement a dynamic tag space mapping onto hardware queues.
> My idea to that would be to map the entire tag space, but rather the
> individual bit words onto the hardware queue. Then we could make the
> mapping dynamic, where there individual words are mapped onto the queues
> only as needed.
> However, the _one_ big problem we're facing here is timeouts.
> With the 1:1 mapping between tags and hardware IDs we can only re-use
> the tag once the timeout is _definitely_ resolved. But this means
> the command will be active, and we cannot return blk_mq_complete() until
> the timeout itself has been resolved.
> With FC this essentially means until the corresponding XIDs are safe to
> re-use, ie after all ABRT/RRQ etc processing has been completed.
> Hence we totally lose the ability to return the command itself with
> -ETIMEDOUT and continue with I/O processing even though the original XID
> is still being held by firmware.
> 
> In the light of this I wonder if it wouldn't be better to completely
> decouple block-layer tags and hardware IDs, and have an efficient
> algorithm mapping the block-layer tags onto hardware IDs.
> That should avoid the arbitrary tag starvation problem, and would allow
> us to handle timeouts efficiently.
> Of course, we don't _have_ such an efficient algorithm; maybe it's time
> to have a generic one within the kernel as quite some drivers would
> _love_ to just use the generic implementation here.
> (qla2xxx, lpfc, fcoe, mpt3sas etc all suffer from the same problem)
> 
> > 2. mq vs non-mq performance gain.
> > 
> > This is more like a poll, I guess. I was wondering what performance gains 
> > folks are observing with SCSI MQ on. I saw Christoph H.'s slide deck that 
> > has one slide that shows a 200k IOPS gain.
> > 
> > From my testing, though, I was not lucky to observe that big of a change. 
> > In fact, the difference was not even noticeable(*). For e.g., for 512 
> > bytes random read test, in both cases, gave me in the vicinity of 2M IOPs. 
> > When I say both cases, I meant, one with scsi_mod's use_blk_mq set to 0 
> > and another with 1 (LLD is reloaded when it is done). I only used one NUMA 
> > node for this run. The test was run on a x86_64 setup.
> > 
> You _really_ should have listened to my talk at VAULT.

Would you have a slide deck / minutes that could be shared?

> For 'legacy' HBAs there indeed is not much of a performance gain to be
> had; the max gain is indeed for heavy parallel I/O.

I have multiple devices (I-T nexuses) in my setup, so definitely there are 
parallel I/Os.

> And there even is a scheduler issue when running with a single
> submission thread; there I've measured a performance _drop_ by up to
> 50%. Which, as Jens claims, really looks like a block-layer issue rather
> than a generic problem.
> 
> 
> > * See item 3 for a special handling.
> > 
> > 3. add_random slowness.
> > 
> > One thing I observed with MQ on and off was this block layer tunable, 
> > add_random, which as I understand is to tune disk entropy contribution. 
> > With non-MQ, it is turned on, and with MQ, it is turned off by default.
> > 
> > This got noticed because, when I was running multi-port testing, there was 
> > a big drop in IOPs with and without MQ (~200K IOPS to 1M+ IOPs when the 
> > test ran on same NUMA node / across NUMA nodes).
> > 
> > Just wondering why we have it ON on one setting and OFF on another.
> > 
> > Sorry for the rather long e-mail, but your comments/thoughts are much 
> > appreciated.
> > 
> You definitely want to use the automatic IRQ-affinity patches from
> Christoph; that proved to be a major gain in high-performance setups (eg
> when running off an all-flash array).

That change is not yet present in the driver. I was using irqbalance 
(oneshot) / custom-script to try out various MSI-X vector to CPU mapping 
in the mean time.

Regards,
-Arun

> 
> Overall, I'm very much interested in these topics; let's continue with
> the discussion to figure out what the best approach here might be.
> 
> Cheers,
> 
> Hannes
>