Re: [PATCH V3 15/25] smartpqi: fix host qdepth limit

John Garry <john.garry@xxxxxxxxxx> · Tue, 19 Jan 2021 10:33:36 +0000

Am 10.12.20 um 21:35 schrieb Don Brace:
From: Mahesh Rajashekhara <mahesh.rajashekhara@xxxxxxxxxxxxx>

* Correct scsi-mid-layer sending more requests than
    exposed host Q depth causing firmware ASSERT issue.
    * Add host Qdepth counter.

This supposedly fixes the regression between Linux 5.4 and 5.9, which
we reported in [1].

      kernel: smartpqi 0000:89:00.0: controller is offline: status code
0x6100c
      kernel: smartpqi 0000:89:00.0: controller offline

Thank you for looking into this issue and fixing it. We are going to
test this.

For easily finding these things in the git history or the WWW, it
would be great if these log messages could be included (in the
future).
DON> Thanks for your suggestion. Well add them in the next time.

Also, that means, that the regression is still present in Linux 5.10,
released yesterday, and this commit does not apply to these versions.

DON> They have started 5.10-RC7 now. So possibly 5.11 or 5.12
depending when all of the patches are applied. The patch in question
is among 28 other patches.

Mahesh, do you have any idea, what commit caused the regression and
why the issue started to show up?
DON> The smartpqi driver sets two scsi_host_template member fields:
.can_queue and .nr_hw_queues. But we have not yet converted to
host_tagset. So the queue_depth becomes nr_hw_queues * can_queue,
which is more than the hw can support. That can be verified by looking
at scsi_host.h.
         /*
          * In scsi-mq mode, the number of hardware queues supported by
the LLD.
          *
          * Note: it is assumed that each hardware queue has a queue
depth of
          * can_queue. In other words, the total queue depth per host
          * is nr_hw_queues * can_queue. However, for when host_tagset
is set,
          * the total queue depth is can_queue.
          */

So, until we make this change, the queue_depth change prevents the
above issue from happening.

can_queue and nr_hw_queues have been set like this as long as the driver existed. Why did Paul observe a regression with 5.9?

And why can't you simply set can_queue to (ctrl_info->scsi_ml_can_queue / nr_hw_queues)?

Don: I did this in an internal patch, but this patch seemed to work the best for our driver. HBA performance remained steady when running benchmarks.

I guess that this is a fallout from commit 6eb045e092ef ("scsi:
 core: avoid host-wide host_busy counter for scsi_mq"). But that commit 
is correct.

If .can_queue is set to (ctrl_info->scsi_ml_can_queue / nr_hw_queues), 
then blk-mq can send each hw queue only (ctrl_info->scsi_ml_can_queue / 
nr_hw_queues) commands, while it should be possible to send 
ctrl_info->scsi_ml_can_queue commands.

I think that this can alternatively be solved by setting .host_tagset flag.

Thanks,
John