Re: KVM pci-assign - iommu width is not sufficient for mapped address

Shyam <shyam.kaushik@xxxxxxxxx> · Mon, 11 Jan 2016 15:41:35 +0530

Hi Alex,

You are spot on!

Applying your patch on QEMU 2.5.50 (latest from github master) solves
the performance issue fully. We are able to get back to pci-assign
performance numbers. Great!

Can you please see how to formalize this patch cleanly? I will be
happy to test additional patches for you. Thanks a lot for your help!

--Shyam

On Sat, Jan 9, 2016 at 12:22 AM, Alex Williamson
<alex.williamson@xxxxxxxxxx> wrote:
> On Fri, 2016-01-08 at 12:22 +0530, Shyam wrote:
>> Hi Alex,
>>
>> It will be hard to reproduce this on Fedora/RHEL. We have Ubuntu
>> based
>> server/VM & I can shift to any kernel/qemu/vfio versions that you
>> recommend.
>>
>> Both our Host & Guest run Ubuntu Trusty (Ubuntu 14.04.3 LTS) with
>> Linux Kernel version 3.18.19 (from
>> http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.18.19-vivid/).
>>
>> Qemu version on the host is
>> QEMU emulator version 2.0.0 (Debian 2.0.0+dfsg-2ubuntu1.21),
>> Copyright
>> (c) 2003-2008 Fabrice Bellard
>>
>> We are using 8 X Intel RMS3CC080 SSD's for this test. We expose these
>> SSD's to the VM (through iSER) & then setup dm-stripe over them
>> within
>> the VM. We create two dm-linear out of this at 100GB size & expose
>> through SCST to an external server. External server iSER connects to
>> these devices & have multipath 4Xpaths (policy: queue-length:0) per
>> device. From external server we run fio with 4 threads & each with
>> 64-outstanding IOs of 100% 4K random-reads.
>>
>> This is the performance difference we see
>>
>> with PCI-assign to the VM
>> randrw 100:0 64iodepth 4thr 4kb - R: 550,224K wait_us:2,245 cpu
>> tot:85.57 usr:3.96 sys:31.55 iow:50.06
>>
>> i.e. we get 137-140K IOPs or 550MB/s
>>
>> with VFIO to the VM
>> randrw 100:0 64iodepth 4thr 4kb - R: 309,432K wait_us:3,964 cpu
>> tot:78.58 usr:2.28 sys:18.00 iow:58.30
>>
>> i.e. we get 77-80K IOPs or 310MB/s
>>
>> The only change between the two runs is to have a VM that is spawned
>> with VFIO instead of pci-assign. There is no other difference in
>> software versions or any settings.
>>
>> $ grep VFIO /boot/config-`uname -r`
>> CONFIG_VFIO_IOMMU_TYPE1=m
>> CONFIG_VFIO=m
>> CONFIG_VFIO_PCI=m
>> CONFIG_VFIO_PCI_VGA=y
>> CONFIG_KVM_VFIO=y
>>
>> I uploaded QEMU command-line & lspci outputs at
>> https://www.dropbox.com/s/imbqn0274i6hhnz/vfio_issue.tgz?dl=0
>>
>> Pls let me know if you have any issues in downloading it.
>>
>> Please let us know if you see any KVM acceleration is disabled &
>> suggested next steps to debug with VFIO tracing. Thanks for your
>> help!
>
> Thanks for the logs, everything appears to be setup correctly.  One
> suspicion I have is the difference between pci-assign and vfio-pci in
> the way the MSI-X Pending Bits Array (PBA) is handled.  Legacy KVM
> device assignment handles MSI-X itself and ignores the PBA.  On this
> hardware the MSI-X vector table and PBA are nicely aligned on separate
> 4k pages, which means that pci-assign will give the VM direct access to
> everything on the PBA page.  On the other hand, vfio-pci registers MSI-
> X with QEMU, which does handle the PBA.  The vast majority of drivers
> never use the PBA and the PCI spec includes an implementation note
> suggesting that hardware vendors include additional alignment to
> prevent MSI-X structures from overlapping with other registers.  My
> hypothesis is that this device perhaps does not abide by that
> recommendation and may be regularly accessing the PBA page, thus
> causing a vfio-pci assigned device to trap through to QEMU more
> regularly than a legacy assigned device.
>
> If I could ask you to build and run a new QEMU, I think we can easily
> test this hypothesis by making vfio-pci behave more like pci-assign.
>  The following patch is based on QEMU 2.5 and simply skips the step of
> placing the PBA memory region overlapping the device, allowing direct
> access in this case.  The patch is easily adaptable to older versions
> of QEMU, but if we need to do any further tracing, it's probably best
> to do so on 2.5 anyway.  This is only a proof of concept, if it proves
> to be the culprit we'll need to think about how to handle it more
> cleanly.  Here's the patch:
>
> diff --git a/hw/pci/msix.c b/hw/pci/msix.c
> index 64c93d8..a5ad18c 100644
> --- a/hw/pci/msix.c
> +++ b/hw/pci/msix.c
> @@ -291,7 +291,7 @@ int msix_init(struct PCIDevice *dev, unsigned short nentries,
>      memory_region_add_subregion(table_bar, table_offset, &dev->msix_table_mmio);
>      memory_region_init_io(&dev->msix_pba_mmio, OBJECT(dev), &msix_pba_mmio_ops, dev,
>                            "msix-pba", pba_size);
> -    memory_region_add_subregion(pba_bar, pba_offset, &dev->msix_pba_mmio);
> +    /* memory_region_add_subregion(pba_bar, pba_offset, &dev->msix_pba_mmio); */
>
>      return 0;
>  }
> @@ -369,7 +369,7 @@ void msix_uninit(PCIDevice *dev, MemoryRegion *table_bar, MemoryRegion *pba_bar)
>      dev->msix_cap = 0;
>      msix_free_irq_entries(dev);
>      dev->msix_entries_nr = 0;
> -    memory_region_del_subregion(pba_bar, &dev->msix_pba_mmio);
> +    /* memory_region_del_subregion(pba_bar, &dev->msix_pba_mmio); */
>      g_free(dev->msix_pba);
>      dev->msix_pba = NULL;
>      memory_region_del_subregion(table_bar, &dev->msix_table_mmio);
>
> Thanks,
> Alex
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html