Hi Alex, You are spot on! Applying your patch on QEMU 2.5.50 (latest from github master) solves the performance issue fully. We are able to get back to pci-assign performance numbers. Great! Can you please see how to formalize this patch cleanly? I will be happy to test additional patches for you. Thanks a lot for your help! --Shyam On Sat, Jan 9, 2016 at 12:22 AM, Alex Williamson <alex.williamson@xxxxxxxxxx> wrote: > On Fri, 2016-01-08 at 12:22 +0530, Shyam wrote: >> Hi Alex, >> >> It will be hard to reproduce this on Fedora/RHEL. We have Ubuntu >> based >> server/VM & I can shift to any kernel/qemu/vfio versions that you >> recommend. >> >> Both our Host & Guest run Ubuntu Trusty (Ubuntu 14.04.3 LTS) with >> Linux Kernel version 3.18.19 (from >> http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.18.19-vivid/). >> >> Qemu version on the host is >> QEMU emulator version 2.0.0 (Debian 2.0.0+dfsg-2ubuntu1.21), >> Copyright >> (c) 2003-2008 Fabrice Bellard >> >> We are using 8 X Intel RMS3CC080 SSD's for this test. We expose these >> SSD's to the VM (through iSER) & then setup dm-stripe over them >> within >> the VM. We create two dm-linear out of this at 100GB size & expose >> through SCST to an external server. External server iSER connects to >> these devices & have multipath 4Xpaths (policy: queue-length:0) per >> device. From external server we run fio with 4 threads & each with >> 64-outstanding IOs of 100% 4K random-reads. >> >> This is the performance difference we see >> >> with PCI-assign to the VM >> randrw 100:0 64iodepth 4thr 4kb - R: 550,224K wait_us:2,245 cpu >> tot:85.57 usr:3.96 sys:31.55 iow:50.06 >> >> i.e. we get 137-140K IOPs or 550MB/s >> >> with VFIO to the VM >> randrw 100:0 64iodepth 4thr 4kb - R: 309,432K wait_us:3,964 cpu >> tot:78.58 usr:2.28 sys:18.00 iow:58.30 >> >> i.e. we get 77-80K IOPs or 310MB/s >> >> The only change between the two runs is to have a VM that is spawned >> with VFIO instead of pci-assign. There is no other difference in >> software versions or any settings. >> >> $ grep VFIO /boot/config-`uname -r` >> CONFIG_VFIO_IOMMU_TYPE1=m >> CONFIG_VFIO=m >> CONFIG_VFIO_PCI=m >> CONFIG_VFIO_PCI_VGA=y >> CONFIG_KVM_VFIO=y >> >> I uploaded QEMU command-line & lspci outputs at >> https://www.dropbox.com/s/imbqn0274i6hhnz/vfio_issue.tgz?dl=0 >> >> Pls let me know if you have any issues in downloading it. >> >> Please let us know if you see any KVM acceleration is disabled & >> suggested next steps to debug with VFIO tracing. Thanks for your >> help! > > Thanks for the logs, everything appears to be setup correctly. One > suspicion I have is the difference between pci-assign and vfio-pci in > the way the MSI-X Pending Bits Array (PBA) is handled. Legacy KVM > device assignment handles MSI-X itself and ignores the PBA. On this > hardware the MSI-X vector table and PBA are nicely aligned on separate > 4k pages, which means that pci-assign will give the VM direct access to > everything on the PBA page. On the other hand, vfio-pci registers MSI- > X with QEMU, which does handle the PBA. The vast majority of drivers > never use the PBA and the PCI spec includes an implementation note > suggesting that hardware vendors include additional alignment to > prevent MSI-X structures from overlapping with other registers. My > hypothesis is that this device perhaps does not abide by that > recommendation and may be regularly accessing the PBA page, thus > causing a vfio-pci assigned device to trap through to QEMU more > regularly than a legacy assigned device. > > If I could ask you to build and run a new QEMU, I think we can easily > test this hypothesis by making vfio-pci behave more like pci-assign. > The following patch is based on QEMU 2.5 and simply skips the step of > placing the PBA memory region overlapping the device, allowing direct > access in this case. The patch is easily adaptable to older versions > of QEMU, but if we need to do any further tracing, it's probably best > to do so on 2.5 anyway. This is only a proof of concept, if it proves > to be the culprit we'll need to think about how to handle it more > cleanly. Here's the patch: > > diff --git a/hw/pci/msix.c b/hw/pci/msix.c > index 64c93d8..a5ad18c 100644 > --- a/hw/pci/msix.c > +++ b/hw/pci/msix.c > @@ -291,7 +291,7 @@ int msix_init(struct PCIDevice *dev, unsigned short nentries, > memory_region_add_subregion(table_bar, table_offset, &dev->msix_table_mmio); > memory_region_init_io(&dev->msix_pba_mmio, OBJECT(dev), &msix_pba_mmio_ops, dev, > "msix-pba", pba_size); > - memory_region_add_subregion(pba_bar, pba_offset, &dev->msix_pba_mmio); > + /* memory_region_add_subregion(pba_bar, pba_offset, &dev->msix_pba_mmio); */ > > return 0; > } > @@ -369,7 +369,7 @@ void msix_uninit(PCIDevice *dev, MemoryRegion *table_bar, MemoryRegion *pba_bar) > dev->msix_cap = 0; > msix_free_irq_entries(dev); > dev->msix_entries_nr = 0; > - memory_region_del_subregion(pba_bar, &dev->msix_pba_mmio); > + /* memory_region_del_subregion(pba_bar, &dev->msix_pba_mmio); */ > g_free(dev->msix_pba); > dev->msix_pba = NULL; > memory_region_del_subregion(table_bar, &dev->msix_table_mmio); > > Thanks, > Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html