Punit Agrawal <punit.agrawal@xxxxxxx> writes: > Hi Jean-Philippe, > > I ran into an issue while testing these series on a Seattle based > system. More below... > > Jean-Philippe Brucker <jean-philippe.brucker@xxxxxxx> writes: > >> Add virtual MSI-X tables for PCI devices, and create IRQFD routes to let >> the kernel inject MSIs from a physical device directly into the guest. >> >> It would be tempting to create the MSI routes at init time before starting >> vCPUs, when we can afford to exit gracefully. But some of it must be >> initialized when the guest requests it. >> >> * On the KVM side, MSIs must be enabled after devices allocate their IRQ >> lines and irqchips are operational, which can happen until late_init. >> >> * On the VFIO side, hardware state of devices may be updated when setting >> up MSIs. For example, when passing a virtio-pci-legacy device to the >> guest: >> >> (1) The device-specific configuration layout (in BAR0) depends on >> whether MSIs are enabled or not in the device. If they are enabled, >> the device-specific configuration starts at offset 24, otherwise it >> starts at offset 20. >> (2) Linux guest assumes that MSIs are initially disabled (doesn't >> actually check the capability). So it reads the device config at >> offset 20. >> (3) Had we enabled MSIs early, host would have enabled the MSI-X >> capability and device would return the config at offset 24. >> (4) The guest would read junk and explode. >> >> Therefore we have to create MSI-X routes when the guest requests MSIs, and >> enable/disable them in VFIO when the guest pokes the MSI-X capability. We >> have to follow both physical and virtual state of the capability, which >> makes the state machine a bit complex, but I think it works. >> >> Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@xxxxxxx> >> --- >> include/kvm/vfio.h | 54 +++++ >> vfio/pci.c | 619 ++++++++++++++++++++++++++++++++++++++++++++++++++++- >> 2 files changed, 664 insertions(+), 9 deletions(-) >> > > [...] > >> +static ssize_t vfio_pci_cap_size(struct pci_cap_hdr *cap_hdr) >> +{ >> + switch (cap_hdr->type) { >> + case PCI_CAP_ID_MSIX: >> + return PCI_CAP_MSIX_SIZEOF; >> + default: >> + pr_err("unknown PCI capability 0x%x", cap_hdr->type); >> + return 0; >> + } >> +} >> + >> +/* >> + * Copy capability from physical header into virtual header, and add it to the >> + * virtual capability list. >> + * >> + * @fd_offset: offset of pci header into vfio device fd >> + * @pos: offset of capability from start of header >> + */ >> +static int vfio_pci_add_cap(struct vfio_device *vdev, struct pci_cap_hdr *cap_hdr, >> + off_t fd_offset, off_t pos) >> +{ >> + int i; >> + ssize_t size = vfio_pci_cap_size(cap_hdr); >> + struct pci_device_header *hdr = &vdev->pci.hdr; >> + struct pci_cap_hdr *out = (void *)hdr + pos; >> + >> + if (pread(vdev->fd, out, size, fd_offset + pos) != size) >> + return -errno; >> + >> + out->next = 0; >> + >> + if (!hdr->capabilities) { >> + hdr->capabilities = pos; >> + hdr->status |= PCI_STATUS_CAP_LIST; >> + } else { >> + /* Add cap at end of list */ >> + struct pci_cap_hdr *last; >> + >> + pci_for_each_cap(i, last, hdr) >> + ; >> + last->next = pos; > > Setting the next capability is somehow overwriting the vendor id in the > virtual header. This leads to the device not being probed by the > appropriate driver. > > I did a quick hack to prevent the vendor_id override and was able to > proceed with probing the device. But there are still some issues with > setting up the capabilities. > > Can you take a look to see what's going wrong? I think I've figured out what's going wrong. pci_for_each_cap is defined as - #define pci_for_each_cap(pos, cap, hdr) \ for ((pos) = (hdr)->capabilities & ~3; \ (cap) = (void *)(hdr) + (pos), (pos) != 0; \ (pos) = ((struct pci_cap_hdr *)(cap))->next & ~3) Here cap is being assigned before the pos != 0 check. So in the last iteration through the loop, cap will point to the start of the PCI header - which is then used to set "last->next = pos". Below change got me the right vendor id and the driver was able to probe the device. But I don't seem to be getting any interrupts. I'll have a look to see if I can figure out what's going wrong. diff --git a/vfio/pci.c b/vfio/pci.c index 9e45d30..ad03325 100644 --- a/vfio/pci.c +++ b/vfio/pci.c @@ -489,7 +489,6 @@ static ssize_t vfio_pci_cap_size(struct pci_cap_hdr *cap_hdr) static int vfio_pci_add_cap(struct vfio_device *vdev, struct pci_cap_hdr *cap_hdr, off_t fd_offset, off_t pos) { - int i; ssize_t size = vfio_pci_cap_size(cap_hdr); struct pci_device_header *hdr = &vdev->pci.hdr; struct pci_cap_hdr *out = (void *)hdr + pos; @@ -505,9 +504,12 @@ static int vfio_pci_add_cap(struct vfio_device *vdev, struct pci_cap_hdr *cap_hd } else { /* Add cap at end of list */ struct pci_cap_hdr *last; + int i = hdr->capabilities & ~3; - pci_for_each_cap(i, last, hdr) - ; + do { + last = (void *)hdr + i; + i = last->next; + } while (i); last->next = pos; } > > Thanks, > Punit > >> + } >> + >> + return 0; >> +} >> + >> static int vfio_pci_parse_caps(struct vfio_device *vdev) >> { >> + u8 pos; >> + int ret; >> + struct pci_cap_hdr cap; >> + ssize_t sz = sizeof(cap); >> + struct vfio_region_info *info; >> struct vfio_pci_device *pdev = &vdev->pci; >> >> if (!(pdev->hdr.status & PCI_STATUS_CAP_LIST)) >> return 0; >> >> + pos = pdev->hdr.capabilities & ~3; >> + info = &vdev->regions[VFIO_PCI_CONFIG_REGION_INDEX].info; >> + >> pdev->hdr.status &= ~PCI_STATUS_CAP_LIST; >> pdev->hdr.capabilities = 0; >> >> - /* TODO: install virtual capabilities */ >> + for (; pos; pos = cap.next) { >> + if (pos >= PCI_DEV_CFG_SIZE) { >> + dev_warn(vdev, "ignoring cap outside of config space"); >> + return -EINVAL; >> + } >> + >> + if (pread(vdev->fd, &cap, sz, info->offset + pos) != sz) { >> + dev_warn(vdev, "failed to read from capabilities pointer (0x%x)", >> + pos); >> + return -EINVAL; >> + } >> + >> + switch (cap.type) { >> + case PCI_CAP_ID_MSIX: >> + ret = vfio_pci_add_cap(vdev, &cap, info->offset, pos); >> + if (ret) { >> + dev_warn(vdev, "failed to read capability structure %x", >> + cap.type); >> + return ret; >> + } >> + >> + pdev->msi.pos = pos; >> + pdev->irq_type = VFIO_PCI_IRQ_MSIX; >> + break; >> + >> + /* Any other capability is hidden */ >> + } >> + } >> >> return 0; >> } > > [...]