NoSnoop is a PCIe attribute that allows devices to issue transactions that bypass cache. If devices are allowed to do this then the hypervisor must emulate instructions like WBINVD or else drivers in the guest have no way to synchronize RAM for the device. Instead of forcing WBINVD when a device is assigned, we can instead prevent the device from using NoSnoop, making all transactions coherent. The danger here is whether we can expect devices to fully support this bit, but this is an improvement over neglecting the problem. This fixes the Code 43 error seen on Windows guests with Intel host systems when trying to assign Nvidia VGA devices. Signed-off-by: Alex Williamson <alex.williamson@xxxxxxxxxx> --- VGA folks - Use a v3.12-rc1 or newer kernel and the 6 qemu patches from my pull request yesterday and you should see a big improvement with this patch. GTX660 is now working well on my Intel box with Win7 guest and Nvidia drivers. RFC - There are a number of solutions to this coherency problem. Legacy KVM device assignment solves it by using the IOMMU's ability to strip NoSnoop attributes from transactions. When this is unavailable, WBINVD emulation is enabled in KVM. AMD-Vi is always capable of doing this, VT-d is a mix. The flaw in its design is that it remaps the guest non-atomically when the IOMMU domain capabilities change. If we use the IOMMU to strip NoSnoop, we cannot guarantee that the feature will always be available and may change for a domain as devices are added or removed. Therefore KVM needs to assume non- coherent or have the ability to switch when VFIO devices are attached. This is where I was headed with the KVM-VFIO pseudo device. However, if we disable NoSnoop at the device, as done here, KVM can always operate assuming cache coherent I/O. Per the spec, this should work. I'm not sure how this will play out across real hardware, but it avoids a lot of messiness with page table flags when IOMMU domains change properties and requires no new interfaces for VFIO or KVM. Note that clearing and preventing writes to the NoSnoop enable bit is not a VFIO problem. VFIO is a userspace driver interface and other drivers may have legitimate use of NoSnoop. This is a hypervisor problem because the hypervisor would like to optimize away WBINVD emulation. QEMU is therefore the proper place to clear the bit and prevent futher changes. hw/misc/vfio.c | 35 +++++++++++++++++++++++++++++++---- 1 file changed, 31 insertions(+), 4 deletions(-) diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c index a2d5283..a2e5742 100644 --- a/hw/misc/vfio.c +++ b/hw/misc/vfio.c @@ -163,6 +163,7 @@ typedef struct VFIODevice { VFIOINTx intx; unsigned int config_size; uint8_t *emulated_config_bits; /* QEMU emulated bits, little-endian */ + uint8_t *read_only_config_bits; /* Bits not guest modifiable to VFIO */ off_t config_offset; /* Offset of config space region within device fd */ unsigned int rom_size; off_t rom_offset; /* Offset of ROM region within device fd */ @@ -1968,13 +1969,20 @@ static void vfio_pci_write_config(PCIDevice *pdev, uint32_t addr, uint32_t val, int len) { VFIODevice *vdev = DO_UPCAST(VFIODevice, pdev, pdev); - uint32_t val_le = cpu_to_le32(val); + uint32_t ro_bits, ro_val = 0, val_le; DPRINTF("%s(%04x:%02x:%02x.%x, @0x%x, 0x%x, len=0x%x)\n", __func__, vdev->host.domain, vdev->host.bus, vdev->host.slot, vdev->host.function, addr, val, len); - /* Write everything to VFIO, let it filter out what we can't write */ + memcpy(&ro_bits, vdev->read_only_config_bits + addr, len); + /* + * Read-only bits should be cleared in pdev->wmask and set to the + * value we want for hardware in pdev->config. + */ + ro_val = pci_default_read_config(pdev, addr, len); + val_le = (cpu_to_le32(ro_val) & ro_bits) | (cpu_to_le32(val) & ~ro_bits); + if (pwrite(vdev->fd, &val_le, len, vdev->config_offset + addr) != len) { error_report("%s(%04x:%02x:%02x.%x, 0x%x, 0x%x, 0x%x) failed: %m", __func__, vdev->host.domain, vdev->host.bus, @@ -2554,10 +2562,10 @@ static void vfio_add_emulated_long(VFIODevice *vdev, int pos, static int vfio_setup_pcie_cap(VFIODevice *vdev, int pos, uint8_t size) { - uint16_t flags; + uint16_t flags, devctl; uint8_t type; - flags = pci_get_word(vdev->pdev.config + pos + PCI_CAP_FLAGS); + flags = pci_get_word(vdev->pdev.config + pos + PCI_EXP_FLAGS); type = (flags & PCI_EXP_FLAGS_TYPE) >> 4; if (type != PCI_EXP_TYPE_ENDPOINT && @@ -2569,6 +2577,22 @@ static int vfio_setup_pcie_cap(VFIODevice *vdev, int pos, uint8_t size) return -EINVAL; } + /* + * As a VM we need to worry about how the hypervisor handles coherency. + * In the typical, non-assigned device case, everything is coherent and + * there's no need to emulate coherency instructions. PCIe however + * allows devices to do non-coherent transactions with the NoSnoop + * attribute. Disable the device from using these transactions and + * expose the bit as read-only to the guest. + */ + devctl = vfio_pci_read_config(&vdev->pdev, pos + PCI_EXP_DEVCTL, 2); + devctl &= ~PCI_EXP_DEVCTL_NOSNOOP_EN; + vfio_pci_write_config(&vdev->pdev, pos + PCI_EXP_DEVCTL, devctl, 2); + vfio_add_emulated_word(vdev, pos + PCI_EXP_DEVCTL, 0, + PCI_EXP_DEVCTL_NOSNOOP_EN); + vfio_set_word_bits(vdev->read_only_config_bits + pos + PCI_EXP_DEVCTL, + PCI_EXP_DEVCTL_NOSNOOP_EN, PCI_EXP_DEVCTL_NOSNOOP_EN); + if (!pci_bus_is_express(vdev->pdev.bus)) { /* * Use express capability as-is on PCI bus. It doesn't make much @@ -3559,6 +3583,7 @@ static int vfio_initfn(PCIDevice *pdev) /* vfio emulates a lot for us, but some bits need extra love */ vdev->emulated_config_bits = g_malloc0(vdev->config_size); + vdev->read_only_config_bits = g_malloc0(vdev->config_size); /* QEMU can choose to expose the ROM or not */ memset(vdev->emulated_config_bits + PCI_ROM_ADDRESS, 0xff, 4); @@ -3621,6 +3646,7 @@ out_teardown: vfio_unmap_bars(vdev); out_put: g_free(vdev->emulated_config_bits); + g_free(vdev->read_only_config_bits); vfio_put_device(vdev); vfio_put_group(group); return ret; @@ -3640,6 +3666,7 @@ static void vfio_exitfn(PCIDevice *pdev) vfio_teardown_msi(vdev); vfio_unmap_bars(vdev); g_free(vdev->emulated_config_bits); + g_free(vdev->read_only_config_bits); g_free(vdev->rom); vfio_put_device(vdev); vfio_put_group(group); -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html