Re: [PROBLEM] Frequently get "irq 31: nobody cared" when passing through 2x GPUs that share same pci switch via vfio

Alex Williamson <alex.williamson@xxxxxxxxxx> · Tue, 12 Oct 2021 14:05:16 -0600

On Tue, 12 Oct 2021 17:58:07 +1300
Matthew Ruffell <matthew.ruffell@xxxxxxxxxxxxx> wrote:

> Hi Alex,
> 
> On Wed, Oct 6, 2021 at 12:13 PM Alex Williamson
> <alex.williamson@xxxxxxxxxx> wrote:
> > With both of these together, I'm so far able to prevent an interrupt
> > storm for these cards.  I'd say the patch below is still extremely
> > experimental, and I'm not sure how to get around the really hacky bit,
> > but it would be interesting to see if it resolves the original issue.
> > I've not yet tested this on a variety of devices, so YMMV.  Thanks,  
> 
> Thank you very much for your analysis and for the experimental patch, and we
> have excellent news to report.
> 
> I sent Nathan a test kernel built on 5.14.0, and he has been running the
> reproducer for a few days now.
> 
> Nathan writes:
> 
> > I've been testing heavily with the reproducer for a few days using all 8 GPUs
> > and with the MSI fix for the audio devices in the guest disabled, i.e. a pretty
> > much worst case scenario. As a control with kernel 5.14 (unpatched), the system
> > locked up in 2,2,6,1, and 4 VM reset iterations, all in less than 10 minutes
> > each time. With the patched kernel I'm currently at 1226 iterations running for
> > 2 days 10 hours with no failures. This is excellent. FYI, I have disabled the
> > dyndbg setting.  
> 
> The system is stable, and your patch sounds very promising.

Great, I also ran a VM reboot loop for several days with all 6 GPUs
assigned, no interrupt issues.

> Nathan does have a small side effect to report:
> 
> > The only thing close to an issue that I have is that I still get frequent
> > "irq 112: nobody cared" and "Disabling IRQ #112" errors. They just no longer
> > lockup the system. If I watch the reproducer time between VM resets, I've
> > noticed that it takes longer for the VM to startup after one of these
> > "nobody cared" errors, and thus it takes longer until I can reset the VM again.
> > I believe slow guest behavior in this disabled IRQ scenario is expected though?  
> 
> Full dmesg:
> https://paste.ubuntu.com/p/hz8WdPZmNZ/
> 
> I had a look at all the lspci Nathan has provided me in the past, but 112 isn't
> listed. I will ask Nathan for a fresh lspci so we can see what device it is.
> The interesting thing is that we still hit __report_bad_irq() for 112 when we
> have previously disabled it, typically after 1000+ seconds has gone by.

The device might need to be operating in INTx mode, or at least had
been at some point, to get the register filled.  It's essentially just
a scratch register on the card that gets filled when the interrupt is
configured.

Each time we register a new handler for the irq the masking due to
spurious interrupt will be removed, but if it's actually causing the VM
boot to take longer that suggests to me that the guest driver is
stalled, perhaps because it's expecting an interrupt that's now masked
in the host.  This could also be caused by a device that gets
incorrectly probed for PCI-2.3 compliant interrupt masking.  For
probing we can really only test that we have the ability to set the
DisINTx bit, we can only hope that the hardware folks also properly
implemented the INTx status bit to indicate the device is signaling
INTx.  We should really figure out which device this is so that we can
focus on whether it's another shared interrupt issue or something
specific to the device.

I'm also confused why this doesn't trigger the same panic/kexec as we
were seeing with the other interrupt lines.  Are there some downstream
patches or configs missing here that would promote these to more fatal
errors?

> We think your patch fixes the interrupt storm issues. We are happy to continue
> testing for as much as you need, and we are happy to test any followup patch
> revisions.
> 
> Is there anything you can do to feel more comfortable about the
> PCI_DEV_FLAGS_MSI_INTX_DISABLE_BUG dev flag hack? While it works, I can see why
> you might not want to land it in mainline.

Yeah, it's a huge hack.  I wonder if we could look at the interrupt
status and conditional'ize clearing DisINTx based on lack of a pending
interrupt.  It seems somewhat reasonable not to clear the bit masking
the interrupt if we know it's pending and know there's no handler for
it.  I'll try to check if that's possible.  Thanks,

Alex