Re: Unable to pass SATA controller to VM with intel_iommu=igfx_off

Alex Williamson <alex.williamson@xxxxxxxxxx> · Mon, 22 Jan 2018 09:38:58 -0700

On Wed, 10 Jan 2018 17:40:13 +0100
Binarus <lists@xxxxxxxxxx> wrote:

> Alex, thank you! I think I have solved the performance problem and have
> made some interesting observations.
> 
> On 09.01.2018 23:41, Alex Williamson wrote:
> >> - Could you please shortly explain what exactly it wants to tell me when
> >> it says that it disables INT xx, and notable if this is a bad thing I
> >> should take care of?  
> > 
> > The "Disabling IRQ XX, nobody cared" message means that the specified
> > IRQ asserted many times without any of the interrupt handlers claiming
> > that it was their device asserting it.  It then masks the interrupt at
> > the APIC.  With device assignment this can mean that the mechanism we
> > use to mask the device doesn't work for that device.  There's a
> > vfio-pci module option you can use to have vfio-pci mask the interrupt
> > at the APIC rather than the device, nointxmask=1.  The trouble with
> > this option is that it can only be used with exclusive interrupts, so
> > if any other devices share the interrupt, starting the VM will fail.
> > As a test, you can unbind conflicting devices from their drivers
> > (assuming non-critical devices).  
> 
> This statement has put me on the right track:
> 
> First, I rebooted the machine without vfio_pci and looked into
> /proc/interrupts. The SATA controller in question was bound to INT 37
> and was the *only* device using that INT.
> 
> I then rebooted with vfio_pci active and tried to start the VM, passing
> through the SATA controller to it. As described in my previous messages,
> the console showed an error message saying that it disabled INT 16 (!)
> when starting the VM.
> 
> I looked into /proc/interrupts again and noticed that INT 16 was bound
> to one of the USB ports, and that this was the only device using INT 16.
> 
> Then I added nointxmask=1 to vfio_pci's options, made depmod and updated
> the initramfs and kept this setting for all further experiments.
> 
> After having rebooted, I removed all "x-no-" options (the ones we talked
> about recently) from the device definitions of the VM. Then I unbound
> the USB port in question (i.e. the one which used INT 16) from its
> driver. Although lspci was still claiming that this USB port was using
> INT 16, /proc/interrupts showed that INT 16 was not bound to a driver
> any more.
> 
> Then I started the VM. The console did not show any messages any more,
> the VM booted without any issue, *and SATA speed was back to normal
> again* (100 MB/s with nointxmask=1 and that USB port unbound versus 2
> MB/s without nointxmask and without unbinding that USB port).
> 
> I have lost one USB port, but finally have full SATA hardware in the VM.
> I can very well live with the lost USB port because there are plenty of
> them, and it was USB 1.1 anyway. I will stick with this configuration
> for the next time.
> 
> *And here is the interesting (from my naive point of view) part which
> might explain what happened:*
> 
> /proc/interrupts (with the VM running!) shows that *vfio-intx is using
> INT 16* now. KVM / Quemu obviously had the idea to assign INT 16 to the
> vfio device *although* INT 16 was already bound to a USB port which was
> active in the host, and *although* the device which is passed through
> would be at INT 37 if vfio_pci would not be active.
> 
> Therefore, the console was showing the error message regarding INT 16;
> obviously, the kernel / KVM / QEMU could not handle the interrupt
> sharing between the host USB port and the vfio_pci device which KVM /
> QEMU had made necessary.
> 
> By the way, this is the only vfio_pci device on this machine.
> 
> Should we consider this behavior a bug? Why does a vfio_pci device get
> bound to an interrupt which is bound to another hardware device on the
> host? Do we have any chance to influence that (modinfo vfio_pci does not
> show any parameter related to interrupt numbers)?

Sharing legacy interrupts on PCI is normal and vfio_pci really has no
control over which interrupt is assigned, it's largely dependent on the
hardware routing.  Sometimes this routing can be controlled in the BIOS
or via ACPI, but it's effectively static from a driver perspective.
When comparing native vs vfio interrupts, make sure the device is in
the same interrupt mode, for instance does /proc/interrupts show
PCI-MSI or IO-APIC for the interrupt line in question.  In legacy
interrupt mode, we're using the IO-APIC.

If using nointxmask=1 resolves the issue, then the device does not
fully support DisINTx, which is a feature introduced in PCI 2.3 and
rather vital to device assignment.  This feature added both an
interrupt status bit and an interrupt disable control bit.  With these,
we can generically determine if the device is asserting the interrupt
line and mask the interrupt on the host while it's handled by the
guest.  This allows assigned devices to use shared interrupts without
device specific drivers.  Unfortunately our only mechanism to probe
whether a device supports DisINTx is to test whether the control bit is
writable, which leaves us with devices like this where that bit may be
writable, but either the INTx status bit never gets set, causing us to
always think another device is asserting the interrupt, or the DisINTx
bit doesn't doesn't actually do anything, allowing the device to
continue to assert the interrupt while we think the device is masked.
Either case can lead to disabling the interrupt due to too many
spurious, unhandled interrupts.

Disabling this support means that the assigned device needs an
exclusive interrupt and we use the IO APIC to mask the interrupt.  Thus
we know that any interrupt is necessarily asserted by our device and we
can mask it without a dependency on the device.  As with DMA aliasing,
the kernel has quirks to deal with this, nointxmask is just a parameter
for testing whether this is the cause.  You can test adding another
quirk to the kernel with this patch:

diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index 10684b17d0bd..602ba0bd5291 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -3262,6 +3262,9 @@ static void quirk_broken_intx_masking(struct pci_dev *dev)
 DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, 0x37d2,
 			quirk_broken_intx_masking);
 
+DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_MARVELL_EXT, 0x9128,
+			quirk_broken_intx_masking);
+
 static u16 mellanox_broken_intx_devs[] = {
 	PCI_DEVICE_ID_MELLANOX_HERMON_SDR,
 	PCI_DEVICE_ID_MELLANOX_HERMON_DDR,

You should be able to remove the nointxmask=1 option with this, but
you'll still need to make sure the device isn't sharing an interrupt as
you did previously.  Thanks,

Alex