Re: [PATCH v4 kvmtool 08/12] vfio-pci: add MSI-X support

Punit Agrawal <punit.agrawal@xxxxxxx> · Fri, 09 Mar 2018 16:53:41 +0000

Hi Jean-Philippe,

A few comments below...

Jean-Philippe Brucker <jean-philippe.brucker@xxxxxxx> writes:

> Add virtual MSI-X tables for PCI devices, and create IRQFD routes to let
> the kernel inject MSIs from a physical device directly into the guest.
>
> It would be tempting to create the MSI routes at init time before starting
> vCPUs, when we can afford to exit gracefully. But some of it must be
> initialized when the guest requests it.
>
> * On the KVM side, MSIs must be enabled after devices allocate their IRQ
>   lines and irqchips are operational, which can happen until late_init.
>
> * On the VFIO side, hardware state of devices may be updated when setting
>   up MSIs. For example, when passing a virtio-pci-legacy device to the
>   guest:
>
>   (1) The device-specific configuration layout (in BAR0) depends on
>       whether MSIs are enabled or not in the device. If they are enabled,
>       the device-specific configuration starts at offset 24, otherwise it
>       starts at offset 20.
>   (2) Linux guest assumes that MSIs are initially disabled (doesn't
>       actually check the capability). So it reads the device config at
>       offset 20.
>   (3) Had we enabled MSIs early, host would have enabled the MSI-X
>       capability and device would return the config at offset 24.
>   (4) The guest would read junk and explode.
>
> Therefore we have to create MSI-X routes when the guest requests MSIs, and
> enable/disable them in VFIO when the guest pokes the MSI-X capability. We
> have to follow both physical and virtual state of the capability, which
> makes the state machine a bit complex, but I think it works.
>
> An important missing feature is the absence of pending MSI handling. When
> a vector or the function is masked, we should rewire the IRQFD to a
> special thread that keeps note of pending interrupts (or just poll the
> IRQFD before recreating the route?). And when the vector is unmasked, one
> MSI should be injected if it was pending. At the moment no MSI is
> injected, we simply disconnect the IRQFD and all messages are lost.
>
> Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@xxxxxxx>
>
> ---
> v3->v4: fix capabilities
> ---
>  include/kvm/vfio.h |  52 +++++
>  vfio/pci.c         | 642 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
>  2 files changed, 682 insertions(+), 12 deletions(-)
>

[...]

> diff --git a/vfio/pci.c b/vfio/pci.c
> index adf6f03a9f0a..ca8ed53265a1 100644
> --- a/vfio/pci.c
> +++ b/vfio/pci.c

[...]

> +static int vfio_pci_disable_msis(struct kvm *kvm, struct vfio_device *vdev)
> +{
> +	int ret;
> +	struct vfio_pci_device *pdev = &vdev->pci;
> +	struct vfio_pci_msi_common *msis = &pdev->msix;
> +	struct vfio_irq_set irq_set = {
> +		.argsz	= sizeof(irq_set),
> +		.flags 	= VFIO_IRQ_SET_DATA_NONE | VFIO_IRQ_SET_ACTION_TRIGGER,
> +		.index 	= msis->info.index,
> +		.start 	= 0,
> +		.count	= 0,
> +	};
> +
> +	if (!msi_is_enabled(msis->phys_state) ||
> +	    msi_is_enabled(msis->virt_state))

Is the check for virt_state really needed? It's not clear why it matters
- please add a comment to save some future head scratching.

> +		return 0;
> +
> +	ret = ioctl(vdev->fd, VFIO_DEVICE_SET_IRQS, &irq_set);
> +	if (ret < 0) {
> +		perror("VFIO_DEVICE_SET_IRQS(NONE)");
> +		return ret;
> +	}
> +
> +	msi_set_enabled(msis->phys_state, false);
> +	msi_set_empty(msis->phys_state, true);
> +
> +	return 0;
> +}
> +

[...]

> +static void vfio_pci_msix_table_access(struct kvm_cpu *vcpu, u64 addr, u8 *data,
> +				       u32 len, u8 is_write, void *ptr)
> +{
> +	struct kvm *kvm = vcpu->kvm;
> +	struct vfio_pci_msi_entry *entry;
> +	struct vfio_pci_device *pdev = ptr;
> +	struct vfio_device *vdev = container_of(pdev, struct vfio_device, pci);
> +
> +	u64 offset = addr - pdev->msix_table.guest_phys_addr;
> +
> +	size_t vector = offset / PCI_MSIX_ENTRY_SIZE;
> +	/* PCI spec says that software must use aligned 4 or 8 bytes accesses */
> +	off_t field = offset % PCI_MSIX_ENTRY_SIZE;

Does the 4/8 byte access assumption be checked? Or is this something
enforced before we get here.

> +	entry = &pdev->msix.entries[vector];
> +
> +	mutex_lock(&pdev->msix.mutex);
> +
> +	if (!is_write) {
> +		memcpy(data, (void *)&entry->config + field, len);
> +		goto out_unlock;
> +	}
> +
> +	memcpy((void *)&entry->config + field, data, len);
> +
> +	if (field != PCI_MSIX_ENTRY_VECTOR_CTRL)
> +		goto out_unlock;

Could an 8 byte access modify the vector control field?

> +
> +	msi_set_masked(entry->virt_state, entry->config.ctrl &
> +		       PCI_MSIX_ENTRY_CTRL_MASKBIT);
> +
> +	if (vfio_pci_update_msi_entry(kvm, vdev, entry) < 0)
> +		/* Not much we can do here. */
> +		dev_err(vdev, "failed to configure MSIX vector %zu", vector);
> +
> +	/* Update the physical capability if necessary */
> +	if (vfio_pci_enable_msis(kvm, vdev))
> +		dev_err(vdev, "cannot enable MSIX");
> +
> +out_unlock:
> +	mutex_unlock(&pdev->msix.mutex);
> +}
> +

[...]

>  
> +static int vfio_pci_create_msix_table(struct kvm *kvm,
> +				      struct vfio_pci_device *pdev)
> +{
> +	int ret;
> +	size_t i;
> +	size_t nr_entries;
> +	size_t table_size;
> +	struct vfio_pci_msi_entry *entries;
> +	struct vfio_pci_msix_pba *pba = &pdev->msix_pba;
> +	struct vfio_pci_msix_table *table = &pdev->msix_table;
> +	struct msix_cap *msix = PCI_CAP(&pdev->hdr, pdev->msix.pos);
> +
> +	table->bar = msix->table_offset & PCI_MSIX_TABLE_BIR;
> +	pba->bar = msix->pba_offset & PCI_MSIX_TABLE_BIR;
> +
> +	/*
> +	 * KVM needs memory regions to be multiple of and aligned on PAGE_SIZE.
> +	 */
> +	nr_entries = (msix->ctrl & PCI_MSIX_FLAGS_QSIZE) + 1;
> +	table_size = ALIGN(nr_entries * PCI_MSIX_ENTRY_SIZE, PAGE_SIZE);
> +
> +	entries = calloc(nr_entries, sizeof(struct vfio_pci_msi_entry));
> +	if (!entries)
> +		return -ENOMEM;
> +
> +	for (i = 0; i < nr_entries; i++)
> +		entries[i].config.ctrl = PCI_MSIX_ENTRY_CTRL_MASKBIT;
> +
> +	/*
> +	 * To ease MSI-X cap configuration in case they share the same BAR,
> +	 * collapse table and pending array. According to PCI, address spaces
> +	 * must be power of two. Since nr_entries is a power of two, and PBA
> +	 * size is less than table_size, reserve 2*table_size.
> +	 */

As I understand it, nr_entries is between 0x0 < nr_entries <= 0x800 but
does not have to be a power of two. Also, it's not clear why "address
spaces must be a power of two" is relevant. What am I missing?

Thanks,
Punit

> +	table->guest_phys_addr = pci_get_io_space_block(2 * table_size);
> +	if (!table->guest_phys_addr) {
> +		pr_err("cannot allocate IO space");
> +		ret = -ENOMEM;
> +		goto out_free;
> +	}
> +	pba->guest_phys_addr = table->guest_phys_addr + table_size;
> +
> +	ret = kvm__register_mmio(kvm, table->guest_phys_addr, table_size, false,
> +				 vfio_pci_msix_table_access, pdev);
> +	if (ret < 0)
> +		goto out_free;
> +
> +	/*
> +	 * We could map the physical PBA directly into the guest, but it's
> +	 * likely smaller than a page, and we can only hand full pages to the
> +	 * guest. Even though the PCI spec disallows sharing a page used for
> +	 * MSI-X with any other resource, it allows to share the same page
> +	 * between MSI-X table and PBA. For the sake of isolation, create a
> +	 * virtual PBA.
> +	 */
> +	pba->size = nr_entries / 8;
> +
> +	ret = kvm__register_mmio(kvm, pba->guest_phys_addr, pba->size, false,
> +				 vfio_pci_msix_pba_access, pdev);
> +	if (ret < 0)
> +		goto out_free;
> +
> +	pdev->msix.entries = entries;
> +	pdev->msix.nr_entries = nr_entries;
> +	table->size = table_size;
> +
> +	return 0;
> +
> +out_free:
> +	free(entries);
> +
> +	return ret;
> +}

[...]