Re: [PATCH v2 kvmtool 07/10] vfio-pci: add MSI-X support

Punit Agrawal <punit.agrawal@xxxxxxx> · Tue, 01 Aug 2017 17:04:14 +0100

Punit Agrawal <punit.agrawal@xxxxxxx> writes:

> Hi Jean-Philippe,
>
> I ran into an issue while testing these series on a Seattle based
> system. More below...
>
> Jean-Philippe Brucker <jean-philippe.brucker@xxxxxxx> writes:
>
>> Add virtual MSI-X tables for PCI devices, and create IRQFD routes to let
>> the kernel inject MSIs from a physical device directly into the guest.
>>
>> It would be tempting to create the MSI routes at init time before starting
>> vCPUs, when we can afford to exit gracefully. But some of it must be
>> initialized when the guest requests it.
>>
>> * On the KVM side, MSIs must be enabled after devices allocate their IRQ
>>   lines and irqchips are operational, which can happen until late_init.
>>
>> * On the VFIO side, hardware state of devices may be updated when setting
>>   up MSIs. For example, when passing a virtio-pci-legacy device to the
>>   guest:
>>
>>   (1) The device-specific configuration layout (in BAR0) depends on
>>       whether MSIs are enabled or not in the device. If they are enabled,
>>       the device-specific configuration starts at offset 24, otherwise it
>>       starts at offset 20.
>>   (2) Linux guest assumes that MSIs are initially disabled (doesn't
>>       actually check the capability). So it reads the device config at
>>       offset 20.
>>   (3) Had we enabled MSIs early, host would have enabled the MSI-X
>>       capability and device would return the config at offset 24.
>>   (4) The guest would read junk and explode.
>>
>> Therefore we have to create MSI-X routes when the guest requests MSIs, and
>> enable/disable them in VFIO when the guest pokes the MSI-X capability. We
>> have to follow both physical and virtual state of the capability, which
>> makes the state machine a bit complex, but I think it works.
>>
>> Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@xxxxxxx>
>> ---
>>  include/kvm/vfio.h |  54 +++++
>>  vfio/pci.c         | 619 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
>>  2 files changed, 664 insertions(+), 9 deletions(-)
>>
>
> [...]
>
>> +static ssize_t vfio_pci_cap_size(struct pci_cap_hdr *cap_hdr)
>> +{
>> +	switch (cap_hdr->type) {
>> +	case PCI_CAP_ID_MSIX:
>> +		return PCI_CAP_MSIX_SIZEOF;
>> +	default:
>> +		pr_err("unknown PCI capability 0x%x", cap_hdr->type);
>> +		return 0;
>> +	}
>> +}
>> +
>> +/*
>> + * Copy capability from physical header into virtual header, and add it to the
>> + * virtual capability list.
>> + *
>> + * @fd_offset: offset of pci header into vfio device fd
>> + * @pos: offset of capability from start of header
>> + */
>> +static int vfio_pci_add_cap(struct vfio_device *vdev, struct pci_cap_hdr *cap_hdr,
>> +			    off_t fd_offset, off_t pos)
>> +{
>> +	int i;
>> +	ssize_t size = vfio_pci_cap_size(cap_hdr);
>> +	struct pci_device_header *hdr = &vdev->pci.hdr;
>> +	struct pci_cap_hdr *out = (void *)hdr + pos;
>> +
>> +	if (pread(vdev->fd, out, size, fd_offset + pos) != size)
>> +		return -errno;
>> +
>> +	out->next = 0;
>> +
>> +	if (!hdr->capabilities) {
>> +		hdr->capabilities = pos;
>> +		hdr->status |= PCI_STATUS_CAP_LIST;
>> +	} else {
>> +		/* Add cap at end of list */
>> +		struct pci_cap_hdr *last;
>> +
>> +		pci_for_each_cap(i, last, hdr)
>> +			;
>> +		last->next = pos;
>
> Setting the next capability is somehow overwriting the vendor id in the
> virtual header. This leads to the device not being probed by the
> appropriate driver.
>
> I did a quick hack to prevent the vendor_id override and was able to
> proceed with probing the device. But there are still some issues with
> setting up the capabilities.
>
> Can you take a look to see what's going wrong?

I think I've figured out what's going wrong. pci_for_each_cap is defined
as - 

#define pci_for_each_cap(pos, cap, hdr)                         \
        for ((pos) = (hdr)->capabilities & ~3;                  \
             (cap) = (void *)(hdr) + (pos), (pos) != 0;         \
             (pos) = ((struct pci_cap_hdr *)(cap))->next & ~3)

Here cap is being assigned before the pos != 0 check. So in the last
iteration through the loop, cap will point to the start of the PCI
header - which is then used to set "last->next = pos". Below change got
me the right vendor id and the driver was able to probe the device. But
I don't seem to be getting any interrupts. I'll have a look to see if I
can figure out what's going wrong.

diff --git a/vfio/pci.c b/vfio/pci.c
index 9e45d30..ad03325 100644
--- a/vfio/pci.c
+++ b/vfio/pci.c
@@ -489,7 +489,6 @@ static ssize_t vfio_pci_cap_size(struct pci_cap_hdr *cap_hdr)
 static int vfio_pci_add_cap(struct vfio_device *vdev, struct pci_cap_hdr *cap_hdr,
 			    off_t fd_offset, off_t pos)
 {
-	int i;
 	ssize_t size = vfio_pci_cap_size(cap_hdr);
 	struct pci_device_header *hdr = &vdev->pci.hdr;
 	struct pci_cap_hdr *out = (void *)hdr + pos;
@@ -505,9 +504,12 @@ static int vfio_pci_add_cap(struct vfio_device *vdev, struct pci_cap_hdr *cap_hd
 	} else {
 		/* Add cap at end of list */
 		struct pci_cap_hdr *last;
+		int i = hdr->capabilities & ~3;
 
-		pci_for_each_cap(i, last, hdr)
-			;
+		do {
+			last = (void *)hdr + i;
+			i = last->next;
+		} while (i);
 		last->next = pos;
 	}

>
> Thanks,
> Punit
>
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>>  static int vfio_pci_parse_caps(struct vfio_device *vdev)
>>  {
>> +	u8 pos;
>> +	int ret;
>> +	struct pci_cap_hdr cap;
>> +	ssize_t sz = sizeof(cap);
>> +	struct vfio_region_info *info;
>>  	struct vfio_pci_device *pdev = &vdev->pci;
>>  
>>  	if (!(pdev->hdr.status & PCI_STATUS_CAP_LIST))
>>  		return 0;
>>  
>> +	pos = pdev->hdr.capabilities & ~3;
>> +	info = &vdev->regions[VFIO_PCI_CONFIG_REGION_INDEX].info;
>> +
>>  	pdev->hdr.status &= ~PCI_STATUS_CAP_LIST;
>>  	pdev->hdr.capabilities = 0;
>>  
>> -	/* TODO: install virtual capabilities */
>> +	for (; pos; pos = cap.next) {
>> +		if (pos >= PCI_DEV_CFG_SIZE) {
>> +			dev_warn(vdev, "ignoring cap outside of config space");
>> +			return -EINVAL;
>> +		}
>> +
>> +		if (pread(vdev->fd, &cap, sz, info->offset + pos) != sz) {
>> +			dev_warn(vdev, "failed to read from capabilities pointer (0x%x)",
>> +				 pos);
>> +			return -EINVAL;
>> +		}
>> +
>> +		switch (cap.type) {
>> +		case PCI_CAP_ID_MSIX:
>> +			ret = vfio_pci_add_cap(vdev, &cap, info->offset, pos);
>> +			if (ret) {
>> +				dev_warn(vdev, "failed to read capability structure %x",
>> +					 cap.type);
>> +				return ret;
>> +			}
>> +
>> +			pdev->msi.pos = pos;
>> +			pdev->irq_type = VFIO_PCI_IRQ_MSIX;
>> +			break;
>> +
>> +			/* Any other capability is hidden */
>> +		}
>> +	}
>>  
>>  	return 0;
>>  }
>
> [...]