Re: Unable to pass SATA controller to VM with intel_iommu=igfx_off

Alex Williamson <alex.williamson@xxxxxxxxxx> · Tue, 9 Jan 2018 15:41:29 -0700

On Tue, 9 Jan 2018 22:36:01 +0100
Binarus <lists@xxxxxxxxxx> wrote:

> To answer my own message:
> 
> On 09.01.2018 18:58, Binarus wrote:
> 
> > The Seabios boot screen hangs for about a minute or so. Then the OS
> > (W2K8 R2 server 64 bit) hangs forever at the first screen which shows
> > the progress bar. By booting into safe mode, I have found out that this
> > happens when it tries to load the classpnp.sys driver.
> > 
> > In some cases, when starting the VM, there was a message on the console
> > saying it was disabling IRQ 16.
> > 
> > This is the point where I am lost (again).  
> 
> It seems I have got it to work. I have added the option
> "x-no-kvm-intx=on" to the device definition. My command line is now:
> 
> /usr/bin/qemu-system-x86_64
>  -machine q35,accel=kvm
>  -cpu host
>  -smp cores=2,threads=2,sockets=1
>  -rtc base=localtime,clock=host,driftfix=none
>  -drive file=/vm-image/dax.img,format=raw,if=virtio,cache=writeback,index=0
>  -device
>   ioh3420,bus=pcie.0,addr=1c.0,multifunction=on,port=2,chassis=1,id=root.1
>  -device vfio-pci,host=02:00.0,bus=root.1,addr=00.0,x-no-kvm-intx=on
>  -boot c
>  -pidfile /root/qemu-kvm/qemu-dax.pid
>  -m 12288
>  -k de
>  -daemonize
>  -usb -usbdevice "tablet"
>  -name dax
>  -device virtio-net-pci,vlan=0,mac=02:01:01:01:02:01
>  -net
>  tap,vlan=0,name=dax,ifname=dax0,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown
>  -vnc :2
> 
> This command line makes the Seabios hang for between 30 and 60 seconds
> (it seems the time it takes is not always the same) during the boot
> process, but then boots up the W2K8 R2 server without any issue. Within
> the VM, I have installed the Marvell Windows drivers for the
> controller's chipset. Great!
> 
> And as desired, I can now cleanly "eject" the disks connected to that
> controller without leaving the VM, i.e. without visiting the host's console.
> 
> Remaining questions:
> 
> - What could make the Seabios hang for such a long time upon every boot?

Perhaps some sort of problem with the device ROM.  Assuming you're not
booting the VM from the assigned device, you can add rombar=0 to the
qemu vfio-pci device options to disable the ROM.  I suppose it's
possible that SeaBIOS might know how to talk to the device regardless
of the ROM, so no guarantees that will resolve it.  Setting a bootindex
both on the vfio-pci device and the actual boot device could help.  I
think the '-boot c' option is deprecated, explicitly specifying a
emulated controller would be better.  virt-install or virt-manager
would do this for you.  Also, using q35 vs 440fx for the VM machine
type makes no difference, q35 is, if anything, more troublesome imo.

> - Could you please shortly explain what the option "x-no-kvm-intx=on"
> does and why I need it in this case?

INTx is the legacy PCI interrupt (ie. INTA, INTB, INTC, INTD).  This is
a level triggered interrupt therefore it continues to assert until the
device is serviced.  It must therefore be masked on the host while it
is handled by the guest.  There are two paths we can use for injecting
this interrupt into the VM and unmasking it on the host once the VM
samples the interrupt.  When KVM is used for acceleration, these happen
via direct connection between the vfio-pci and kvm modules using
eventfds and irqfds.  The x-no-kvm-intx option disables that path,
instead bouncing out to QEMU to do the same.

TBH, I have no idea why this would make it work.  The QEMU path is
slower than the KVM path, but they should be functionally identical.

> - Could you please shortly explain what exactly it wants to tell me when
> it says that it disables INT xx, and notable if this is a bad thing I
> should take care of?

The "Disabling IRQ XX, nobody cared" message means that the specified
IRQ asserted many times without any of the interrupt handlers claiming
that it was their device asserting it.  It then masks the interrupt at
the APIC.  With device assignment this can mean that the mechanism we
use to mask the device doesn't work for that device.  There's a
vfio-pci module option you can use to have vfio-pci mask the interrupt
at the APIC rather than the device, nointxmask=1.  The trouble with
this option is that it can only be used with exclusive interrupts, so
if any other devices share the interrupt, starting the VM will fail.
As a test, you can unbind conflicting devices from their drivers
(assuming non-critical devices).

The troublesome point here is that regardless of x-no-kvm-intx, the
kernel uses the same masking technique for the device, so it's unclear
why one works and the other does not.

> - What about the "x-no-kvm-msi" and "x-no-kvm-msix" options? Would it be
> better to use them as well? I couldn't find any sound information about
> what exactly they do (Note: Initially, I had all three of those
> "x-no..." options active, which made the VM boot the first time, and
> later out of curiosity found out that "x-no-kvm-intx" is the essential
> one. Without this one, the VM won't boot; the other two don't seem to
> change anything in my case).

Similar to the INTx version, they route the interrupts out through QEMU
rather than inject them through a side channel with KVM.  They're just
slower.  Generally these options are only used for debugging as they
make the interrupts visible to QEMU, functionality is generally not
affected.

What interrupt mode does the device operate in once the VM is running?
You can run 'lspci -vs <device address>' on the host and see something
like:

	Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
	Capabilities: [70] MSI-X: Enable+ Count=10 Masked-

In this case the Enable+ shows the device is using MSI-X rather than
MSI, which shows Enable-.  The device might not support both (or
either).  If none are Enable+, legacy interrupts are probably being
used.

Often legacy interrupts are only used at boot and then the device
switches to MSI/X.  If that's the case for this device, x-no-kvm-intx
doesn't really hurt you runtime.

> - Could we expect your patch to go into upstream (perhaps after the
> above issues / questions have been investigated)? I will try to convince
> the Debian people to include the patch into 4.9; if they refuse, I will
> have to compile a new kernel each time they release one, which happens
> quite often (probably security fixes) since some time ...

I would not recommend trying to convince Debian to take a non-upstream
patch, the process is that I need to do more research to figure out
why this device isn't already quirked, I'm sure others have complained,
but did similar patches make things worse for them or did they simply
disappear.  Can you confirm whether the device behaves properly for
host use with the patch?  Issues with assigning the device could be
considered secondary if the host behavior is obviously improved.
Alternatively, the 9230, or various others in that section of the
quirk code, are already quirked, so you can decide if picking a
different $30 card is a better option for you ;) Thanks,

Alex