On Fri, 2016-01-08 at 12:22 +0530, Shyam wrote: > Hi Alex, > > It will be hard to reproduce this on Fedora/RHEL. We have Ubuntu > based > server/VM & I can shift to any kernel/qemu/vfio versions that you > recommend. > > Both our Host & Guest run Ubuntu Trusty (Ubuntu 14.04.3 LTS) with > Linux Kernel version 3.18.19 (from > http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.18.19-vivid/). > > Qemu version on the host is > QEMU emulator version 2.0.0 (Debian 2.0.0+dfsg-2ubuntu1.21), > Copyright > (c) 2003-2008 Fabrice Bellard > > We are using 8 X Intel RMS3CC080 SSD's for this test. We expose these > SSD's to the VM (through iSER) & then setup dm-stripe over them > within > the VM. We create two dm-linear out of this at 100GB size & expose > through SCST to an external server. External server iSER connects to > these devices & have multipath 4Xpaths (policy: queue-length:0) per > device. From external server we run fio with 4 threads & each with > 64-outstanding IOs of 100% 4K random-reads. > > This is the performance difference we see > > with PCI-assign to the VM > randrw 100:0 64iodepth 4thr 4kb - R: 550,224K wait_us:2,245 cpu > tot:85.57 usr:3.96 sys:31.55 iow:50.06 > > i.e. we get 137-140K IOPs or 550MB/s > > with VFIO to the VM > randrw 100:0 64iodepth 4thr 4kb - R: 309,432K wait_us:3,964 cpu > tot:78.58 usr:2.28 sys:18.00 iow:58.30 > > i.e. we get 77-80K IOPs or 310MB/s > > The only change between the two runs is to have a VM that is spawned > with VFIO instead of pci-assign. There is no other difference in > software versions or any settings. > > $ grep VFIO /boot/config-`uname -r` > CONFIG_VFIO_IOMMU_TYPE1=m > CONFIG_VFIO=m > CONFIG_VFIO_PCI=m > CONFIG_VFIO_PCI_VGA=y > CONFIG_KVM_VFIO=y > > I uploaded QEMU command-line & lspci outputs at > https://www.dropbox.com/s/imbqn0274i6hhnz/vfio_issue.tgz?dl=0 > > Pls let me know if you have any issues in downloading it. > > Please let us know if you see any KVM acceleration is disabled & > suggested next steps to debug with VFIO tracing. Thanks for your > help! Thanks for the logs, everything appears to be setup correctly. One suspicion I have is the difference between pci-assign and vfio-pci in the way the MSI-X Pending Bits Array (PBA) is handled. Legacy KVM device assignment handles MSI-X itself and ignores the PBA. On this hardware the MSI-X vector table and PBA are nicely aligned on separate 4k pages, which means that pci-assign will give the VM direct access to everything on the PBA page. On the other hand, vfio-pci registers MSI- X with QEMU, which does handle the PBA. The vast majority of drivers never use the PBA and the PCI spec includes an implementation note suggesting that hardware vendors include additional alignment to prevent MSI-X structures from overlapping with other registers. My hypothesis is that this device perhaps does not abide by that recommendation and may be regularly accessing the PBA page, thus causing a vfio-pci assigned device to trap through to QEMU more regularly than a legacy assigned device. If I could ask you to build and run a new QEMU, I think we can easily test this hypothesis by making vfio-pci behave more like pci-assign. The following patch is based on QEMU 2.5 and simply skips the step of placing the PBA memory region overlapping the device, allowing direct access in this case. The patch is easily adaptable to older versions of QEMU, but if we need to do any further tracing, it's probably best to do so on 2.5 anyway. This is only a proof of concept, if it proves to be the culprit we'll need to think about how to handle it more cleanly. Here's the patch: diff --git a/hw/pci/msix.c b/hw/pci/msix.c index 64c93d8..a5ad18c 100644 --- a/hw/pci/msix.c +++ b/hw/pci/msix.c @@ -291,7 +291,7 @@ int msix_init(struct PCIDevice *dev, unsigned short nentries, memory_region_add_subregion(table_bar, table_offset, &dev->msix_table_mmio); memory_region_init_io(&dev->msix_pba_mmio, OBJECT(dev), &msix_pba_mmio_ops, dev, "msix-pba", pba_size); - memory_region_add_subregion(pba_bar, pba_offset, &dev->msix_pba_mmio); + /* memory_region_add_subregion(pba_bar, pba_offset, &dev->msix_pba_mmio); */ return 0; } @@ -369,7 +369,7 @@ void msix_uninit(PCIDevice *dev, MemoryRegion *table_bar, MemoryRegion *pba_bar) dev->msix_cap = 0; msix_free_irq_entries(dev); dev->msix_entries_nr = 0; - memory_region_del_subregion(pba_bar, &dev->msix_pba_mmio); + /* memory_region_del_subregion(pba_bar, &dev->msix_pba_mmio); */ g_free(dev->msix_pba); dev->msix_pba = NULL; memory_region_del_subregion(table_bar, &dev->msix_table_mmio); Thanks, Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html