RE: [PATCH] aacraid: reply queue mapping to CPUs based of IRQ affinity

<Sagar.Biradar@xxxxxxxxxxxxx> · Fri, 14 Apr 2023 20:07:20 +0000

-----Original Message-----
From: John Garry <john.g.garry@xxxxxxxxxx> 
Sent: Wednesday, April 12, 2023 2:38 AM
To: Sagar Biradar - C34249 <Sagar.Biradar@xxxxxxxxxxxxx>; Don Brace - C33706 <Don.Brace@xxxxxxxxxxxxx>; Gilbert Wu - C33504 <Gilbert.Wu@xxxxxxxxxxxxx>; linux-scsi@xxxxxxxxxxxxxxx; martin.petersen@xxxxxxxxxx; jejb@xxxxxxxxxxxxx; brking@xxxxxxxxxxxxxxxxxx; stable@xxxxxxxxxxxxxxx; Tom White - C33503 <Tom.White@xxxxxxxxxxxxx>
Subject: Re: [PATCH] aacraid: reply queue mapping to CPUs based of IRQ affinity

EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe

On 10/04/2023 22:17, Sagar.Biradar@xxxxxxxxxxxxx wrote:
> On 28/03/2023 22:41, Sagar Biradar wrote:
>> Fix the IO hang that arises because of MSIx vector not having a 
>> mapped online CPU upon receiving completion.
> What about if the CPU targeted goes offline while the IO is in-flight?
>
>> This patch sets up a reply queue mapping to CPUs based on the IRQ 
>> affinity retrieved using pci_irq_get_affinity() API.
>>
> blk-mq already does what you want here, including handling for the case I mention above. It maintains a CPU -> HW queue mapping, and using a reply map in the LLD is the old way of doing this.
>
> Could you instead follow the example in commit 664f0dce2058 ("scsi:
> mpt3sas: Add support for shared host tagset for CPU hotplug"), and expose the HW queues to the upper layer? You can alternatively check the example of any SCSI driver which sets shost->host_tagset for this.
>
> Thanks,
> John
> [Sagar Biradar]
>
> ***What about if the CPU targeted goes offline while the IO is in-flight?
> We ran multiple random cases with the IO's running in parallel and disabling load-bearing CPU's. We saw that the load was transferred to the other online CPUs successfully every time.
> The same was tested at vendor and their customer site - they did not see any issues too.

You need to ensure that all CPUs associated with the HW queue are offline and stay offline until any IO may timeout, which would be 30 seconds according to SCSI sd default timeout. I am not sure if you were doing that exactly. 

[Sagar Biradar] 
Responding again inline to the original thread to avoid confusion . . . 

We disabled 14 out of 16 CPUs and each time we saw the interrupts migrated to the other CPUs.
The CPUs remained offline for varying times, each of which were more than 30 seconds.
We monitored proper behavior of the threads running on CPUs and observed them migrating to other CPUs as they were disabled.
We, along with the vendor/customer, did not observe any command timeouts in these experiments. 
In case any commands time out - the driver will resort to the error handling mechanism.

Also, there is this patch which addresses the concerns John Garry raised.
https://lore.kernel.org/lkml/20220929033428.25948-1-mj0123.lee@xxxxxxxxxxx/T/

This patch explains how the coordination happens when a CPU goes offline.
IPI can be read Inter-Processor Interrupt.
The request shall be completed from the CPU where it is running when the original CPU goes offline.

>
>
> ***blk-mq already does what you want here, including handling for the case I mention above. It maintains a CPU -> HW queue mapping, and using a reply map in the LLD is the old way of doing this.
> We also tried implementing the blk-mq mechanism in the driver and we saw command timeouts.
> The firmware has limitation of fixed number of queues per vector and the blk-mq changes would saturate that limit.
> That answers the possible command timeout.
>
> Also this is EOL product and there will be no firmware code changes. Given this, we have decided to stick to the reply_map mechanism.
> (https://urldefense.com/v3/__https://storage.microsemi.com/en-us/suppo
> rt/series8/index.php__;!!ACWV5N9M2RV99hQ!PLrbfoEBvEGxw2CvahCL0AP5c4f5c
> Q8gT0ahXVgB0mSbyqxWJ8pdtYY0JwRL8xZ59k0NHJhXCBbMtVWlq5pYMeOEHmw7ww$  )
>
> Thank you for your review comments and we hope you will reconsider the original patch.

I've been checking the driver a bit more and this drivers uses some "reserved" commands, right? That would be internal commands which the driver sends to the adapter which does not have a scsi_cmnd associated.
If so, it gets a bit more tricky to use blk-mq support for HW queues, as we need to manually find a HW queue for those "reserved commands", like
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/scsi/hisi_sas/hisi_sas_main.c?h=v6.3-rc6#n532

Anyway, it's not up to me ... 

[Sagar Biradar] 
Yes, we have reserved commands, that originate from within the driver.
We rely on the reply_map mechanism (from the original patch) to get interrupt vector for the reserved commands too.

Thanks,
John