Re: [PROBLEM] Frequently get "irq 31: nobody cared" when passing through 2x GPUs that share same pci switch via vfio

Matthew Ruffell <matthew.ruffell@xxxxxxxxxxxxx> · Tue, 12 Oct 2021 17:58:07 +1300

Hi Alex,

On Wed, Oct 6, 2021 at 12:13 PM Alex Williamson
<alex.williamson@xxxxxxxxxx> wrote:
> With both of these together, I'm so far able to prevent an interrupt
> storm for these cards.  I'd say the patch below is still extremely
> experimental, and I'm not sure how to get around the really hacky bit,
> but it would be interesting to see if it resolves the original issue.
> I've not yet tested this on a variety of devices, so YMMV.  Thanks,

Thank you very much for your analysis and for the experimental patch, and we
have excellent news to report.

I sent Nathan a test kernel built on 5.14.0, and he has been running the
reproducer for a few days now.

Nathan writes:

> I've been testing heavily with the reproducer for a few days using all 8 GPUs
> and with the MSI fix for the audio devices in the guest disabled, i.e. a pretty
> much worst case scenario. As a control with kernel 5.14 (unpatched), the system
> locked up in 2,2,6,1, and 4 VM reset iterations, all in less than 10 minutes
> each time. With the patched kernel I'm currently at 1226 iterations running for
> 2 days 10 hours with no failures. This is excellent. FYI, I have disabled the
> dyndbg setting.

The system is stable, and your patch sounds very promising.

Nathan does have a small side effect to report:

> The only thing close to an issue that I have is that I still get frequent
> "irq 112: nobody cared" and "Disabling IRQ #112" errors. They just no longer
> lockup the system. If I watch the reproducer time between VM resets, I've
> noticed that it takes longer for the VM to startup after one of these
> "nobody cared" errors, and thus it takes longer until I can reset the VM again.
> I believe slow guest behavior in this disabled IRQ scenario is expected though?

Full dmesg:
https://paste.ubuntu.com/p/hz8WdPZmNZ/

I had a look at all the lspci Nathan has provided me in the past, but 112 isn't
listed. I will ask Nathan for a fresh lspci so we can see what device it is.
The interesting thing is that we still hit __report_bad_irq() for 112 when we
have previously disabled it, typically after 1000+ seconds has gone by.

We think your patch fixes the interrupt storm issues. We are happy to continue
testing for as much as you need, and we are happy to test any followup patch
revisions.

Is there anything you can do to feel more comfortable about the
PCI_DEV_FLAGS_MSI_INTX_DISABLE_BUG dev flag hack? While it works, I can see why
you might not want to land it in mainline.

Thanks,
Matthew