Hi Jean-Philippe, A few comments below... Jean-Philippe Brucker <jean-philippe.brucker@xxxxxxx> writes: > Add virtual MSI-X tables for PCI devices, and create IRQFD routes to let > the kernel inject MSIs from a physical device directly into the guest. > > It would be tempting to create the MSI routes at init time before starting > vCPUs, when we can afford to exit gracefully. But some of it must be > initialized when the guest requests it. > > * On the KVM side, MSIs must be enabled after devices allocate their IRQ > lines and irqchips are operational, which can happen until late_init. > > * On the VFIO side, hardware state of devices may be updated when setting > up MSIs. For example, when passing a virtio-pci-legacy device to the > guest: > > (1) The device-specific configuration layout (in BAR0) depends on > whether MSIs are enabled or not in the device. If they are enabled, > the device-specific configuration starts at offset 24, otherwise it > starts at offset 20. > (2) Linux guest assumes that MSIs are initially disabled (doesn't > actually check the capability). So it reads the device config at > offset 20. > (3) Had we enabled MSIs early, host would have enabled the MSI-X > capability and device would return the config at offset 24. > (4) The guest would read junk and explode. > > Therefore we have to create MSI-X routes when the guest requests MSIs, and > enable/disable them in VFIO when the guest pokes the MSI-X capability. We > have to follow both physical and virtual state of the capability, which > makes the state machine a bit complex, but I think it works. > > An important missing feature is the absence of pending MSI handling. When > a vector or the function is masked, we should rewire the IRQFD to a > special thread that keeps note of pending interrupts (or just poll the > IRQFD before recreating the route?). And when the vector is unmasked, one > MSI should be injected if it was pending. At the moment no MSI is > injected, we simply disconnect the IRQFD and all messages are lost. > > Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@xxxxxxx> > > --- > v3->v4: fix capabilities > --- > include/kvm/vfio.h | 52 +++++ > vfio/pci.c | 642 ++++++++++++++++++++++++++++++++++++++++++++++++++++- > 2 files changed, 682 insertions(+), 12 deletions(-) > [...] > diff --git a/vfio/pci.c b/vfio/pci.c > index adf6f03a9f0a..ca8ed53265a1 100644 > --- a/vfio/pci.c > +++ b/vfio/pci.c [...] > +static int vfio_pci_disable_msis(struct kvm *kvm, struct vfio_device *vdev) > +{ > + int ret; > + struct vfio_pci_device *pdev = &vdev->pci; > + struct vfio_pci_msi_common *msis = &pdev->msix; > + struct vfio_irq_set irq_set = { > + .argsz = sizeof(irq_set), > + .flags = VFIO_IRQ_SET_DATA_NONE | VFIO_IRQ_SET_ACTION_TRIGGER, > + .index = msis->info.index, > + .start = 0, > + .count = 0, > + }; > + > + if (!msi_is_enabled(msis->phys_state) || > + msi_is_enabled(msis->virt_state)) Is the check for virt_state really needed? It's not clear why it matters - please add a comment to save some future head scratching. > + return 0; > + > + ret = ioctl(vdev->fd, VFIO_DEVICE_SET_IRQS, &irq_set); > + if (ret < 0) { > + perror("VFIO_DEVICE_SET_IRQS(NONE)"); > + return ret; > + } > + > + msi_set_enabled(msis->phys_state, false); > + msi_set_empty(msis->phys_state, true); > + > + return 0; > +} > + [...] > +static void vfio_pci_msix_table_access(struct kvm_cpu *vcpu, u64 addr, u8 *data, > + u32 len, u8 is_write, void *ptr) > +{ > + struct kvm *kvm = vcpu->kvm; > + struct vfio_pci_msi_entry *entry; > + struct vfio_pci_device *pdev = ptr; > + struct vfio_device *vdev = container_of(pdev, struct vfio_device, pci); > + > + u64 offset = addr - pdev->msix_table.guest_phys_addr; > + > + size_t vector = offset / PCI_MSIX_ENTRY_SIZE; > + /* PCI spec says that software must use aligned 4 or 8 bytes accesses */ > + off_t field = offset % PCI_MSIX_ENTRY_SIZE; Does the 4/8 byte access assumption be checked? Or is this something enforced before we get here. > + entry = &pdev->msix.entries[vector]; > + > + mutex_lock(&pdev->msix.mutex); > + > + if (!is_write) { > + memcpy(data, (void *)&entry->config + field, len); > + goto out_unlock; > + } > + > + memcpy((void *)&entry->config + field, data, len); > + > + if (field != PCI_MSIX_ENTRY_VECTOR_CTRL) > + goto out_unlock; Could an 8 byte access modify the vector control field? > + > + msi_set_masked(entry->virt_state, entry->config.ctrl & > + PCI_MSIX_ENTRY_CTRL_MASKBIT); > + > + if (vfio_pci_update_msi_entry(kvm, vdev, entry) < 0) > + /* Not much we can do here. */ > + dev_err(vdev, "failed to configure MSIX vector %zu", vector); > + > + /* Update the physical capability if necessary */ > + if (vfio_pci_enable_msis(kvm, vdev)) > + dev_err(vdev, "cannot enable MSIX"); > + > +out_unlock: > + mutex_unlock(&pdev->msix.mutex); > +} > + [...] > > +static int vfio_pci_create_msix_table(struct kvm *kvm, > + struct vfio_pci_device *pdev) > +{ > + int ret; > + size_t i; > + size_t nr_entries; > + size_t table_size; > + struct vfio_pci_msi_entry *entries; > + struct vfio_pci_msix_pba *pba = &pdev->msix_pba; > + struct vfio_pci_msix_table *table = &pdev->msix_table; > + struct msix_cap *msix = PCI_CAP(&pdev->hdr, pdev->msix.pos); > + > + table->bar = msix->table_offset & PCI_MSIX_TABLE_BIR; > + pba->bar = msix->pba_offset & PCI_MSIX_TABLE_BIR; > + > + /* > + * KVM needs memory regions to be multiple of and aligned on PAGE_SIZE. > + */ > + nr_entries = (msix->ctrl & PCI_MSIX_FLAGS_QSIZE) + 1; > + table_size = ALIGN(nr_entries * PCI_MSIX_ENTRY_SIZE, PAGE_SIZE); > + > + entries = calloc(nr_entries, sizeof(struct vfio_pci_msi_entry)); > + if (!entries) > + return -ENOMEM; > + > + for (i = 0; i < nr_entries; i++) > + entries[i].config.ctrl = PCI_MSIX_ENTRY_CTRL_MASKBIT; > + > + /* > + * To ease MSI-X cap configuration in case they share the same BAR, > + * collapse table and pending array. According to PCI, address spaces > + * must be power of two. Since nr_entries is a power of two, and PBA > + * size is less than table_size, reserve 2*table_size. > + */ As I understand it, nr_entries is between 0x0 < nr_entries <= 0x800 but does not have to be a power of two. Also, it's not clear why "address spaces must be a power of two" is relevant. What am I missing? Thanks, Punit > + table->guest_phys_addr = pci_get_io_space_block(2 * table_size); > + if (!table->guest_phys_addr) { > + pr_err("cannot allocate IO space"); > + ret = -ENOMEM; > + goto out_free; > + } > + pba->guest_phys_addr = table->guest_phys_addr + table_size; > + > + ret = kvm__register_mmio(kvm, table->guest_phys_addr, table_size, false, > + vfio_pci_msix_table_access, pdev); > + if (ret < 0) > + goto out_free; > + > + /* > + * We could map the physical PBA directly into the guest, but it's > + * likely smaller than a page, and we can only hand full pages to the > + * guest. Even though the PCI spec disallows sharing a page used for > + * MSI-X with any other resource, it allows to share the same page > + * between MSI-X table and PBA. For the sake of isolation, create a > + * virtual PBA. > + */ > + pba->size = nr_entries / 8; > + > + ret = kvm__register_mmio(kvm, pba->guest_phys_addr, pba->size, false, > + vfio_pci_msix_pba_access, pdev); > + if (ret < 0) > + goto out_free; > + > + pdev->msix.entries = entries; > + pdev->msix.nr_entries = nr_entries; > + table->size = table_size; > + > + return 0; > + > +out_free: > + free(entries); > + > + return ret; > +} [...]