> On 23 Feb 2021, at 08:57, John Garry <john.garry@xxxxxxxxxx> wrote: > > On 22/02/2021 14:23, Roger Willcocks wrote: >> FYI we have exactly this issue on a machine here running CentOS 8.3 (kernel 4.18.0-240.1.1) (so presumably this happens in RHEL 8 too.) >> Controller is MSCC / Adaptec 3154-8i16e driving 60 x 12TB HGST drives configured as five x twelve-drive raid-6, software striped using md, and formatted with xfs. >> Test software writes to the array using multiple threads in parallel. >> The smartpqi driver would report controller offline within ten minutes or so, with status code 0x6100c >> Changed the driver to set 'nr_hw_queues = 1’ and then tested by filling the array with random files (which took a couple of days), which completed fine, so it looks like that one-line change fixes it. > > That just makes the driver single-queue. > All I can say is it fixes the problem. Write performance is two or three percent faster than CentOS 6.5 on the same hardware. > As such, since the driver uses blk_mq_unique_tag_to_hwq(), only hw queue #0 will ever be used in the driver. > > And then, since the driver still spreads MSI-X interrupt vectors over all CPUs [from pci_alloc_vectors(PCI_IRQ_AFFINITY)], if CPUs associated with HW queue #0 are offlined (probably just cpu0), there is no CPUs available to service queue #0 interrupt. That's what I think would happen, from a quick glance at the code. > Surely that would be an issue even if it used multiple queues (one of which would be queue #0) ? > >> Would, of course, be helpful if this was back-ported. >> — >> Roger