Re: [PATCH 5/5] KVM: assigned dev: MSI-X mask support

"Michael S. Tsirkin" <mst@xxxxxxxxxx> · Fri, 5 Nov 2010 10:43:33 +0200

On Fri, Nov 05, 2010 at 10:44:19AM +0800, Sheng Yang wrote:
> > > +};
> > > +
> > > +This ioctl would enable in-kernel MSI-X emulation, which would handle
> > > MSI-X +mask bit in the kernel.
> > 
> > What happens on repeated calls when it's already enabled?
> > How does one disable after it's enabled?
> 
> Suppose this should only be called once. But again would update the MMIO base.

So maybe add this all to documentation.

> It 
> would be disabled along with device deassignment.

So what are you going to do when guest enables and later disables MSIX?
Disable assignment completely?

> We enable it explicitly because 
> not all devices have MSI-X capability.

> > 
> > > +
> > 
> > This is very specialized for assigned devices.  I would like an
> > interface not tied to assigned devices explicitly
> > (even if the implementation at the moment only works
> > for assigned devices, I can patch that in later). E.g. vhost-net would
> > benefit from such an extension too.  Maybe tie this to a special
> > GSI and this GSI can be an assigned device or not?
> 
> You're talking about KVM_ASSIGN_GET_MSIX_ENTRY?

Yes but also, and more importantly, about KVM_ASSIGN_REG_MSIX_MMIO
and kvm_assigned_msix_mmio really.

> We can't tie it to GSI, because we 
> should also cover entries without GSI. The entry number should be fine for all 
> kinds of devices using MSI-X.

I don't completely understand: entries without GSI
never need to inject an interrupt. Why do we care
about them? Let's pass such accesses to userspace.

In any case, I think MSIX GSIs are inexpensive (we never search them
all), so we could create GSIs for all entries if we wanted to.

> I am not sure about what PV one would looks like.
> I use kvm_assigned_msix_entry 
> just because it can be easily reused.

Not only PV, emulated devices too. OK let's try to figure out:

What happens with userspace is, we inject interrupts either through
ioctls or eventfds. Avoiding an exit to userspace on MSIX
access is beneficial in exactly the same way.

We could have an ioctl to pass in the table addresses, and mapping
table to relevant GSIs. For example, add kvm_msix_mmio with table
start/end, also specify which GSIs are covered. Maybe ask that
userspace allocate all GSIs for a table consequitively?
Or add kvm_irq_routing_msix which is same as msi but
also has the mmio addresses? Bonus points for avoiding the need
to scan all table on each write. For example, don't scan at all,
instead just set a bit in KVM, and set irq entry on the next
interrupt only: we have KVM_MAX_MSIX_PER_DEV as 256 so maybe a small 32 byte bitmap
would be enough for this, avoding a linear scan.

When we pass in kvm_msix_mmio, kvm would start registering masked state.

> > 
> > > +4.48 KVM_ASSIGN_GET_MSIX_ENTRY
> > > +
> > > +Capability: KVM_CAP_DEVICE_MSIX_MASK
> > > +Architectures: x86
> > > +Type: vm ioctl
> > > +Parameters: struct kvm_assigned_msix_entry (in and out)
> > > +Returns: 0 on success, !0 on error
> > > +
> > > +struct kvm_assigned_msix_entry {
> > > +	/* Assigned device's ID */
> > > +	__u32 assigned_dev_id;
> > > +	/* Ignored */
> > > +	__u32 gsi;
> > > +	/* The index of entry in the MSI-X table */
> > > +	__u16 entry;
> > > +	/* Querying flags and returning status */
> > > +	__u16 flags;
> > > +	/* Must be 0 */
> > > +	__u16 padding[2];
> > > +};
> > > +
> > > +This ioctl would allow userspace to get the status of one specific MSI-X
> > > +entry. Currently we support mask bit status querying.
> > > +
> > > 
> > >  5. The kvm_run structure
> > >  
> > >  Application code obtains a pointer to the kvm_run structure by
> > > 
> > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > index f3f86b2..8fd5121 100644
> > > --- a/arch/x86/kvm/x86.c
> > > +++ b/arch/x86/kvm/x86.c
> > > @@ -1926,6 +1926,7 @@ int kvm_dev_ioctl_check_extension(long ext)
> > > 
> > >  	case KVM_CAP_DEBUGREGS:
> > >  	case KVM_CAP_X86_ROBUST_SINGLESTEP:
> > > 
> > >  	case KVM_CAP_XSAVE:
> > > +	case KVM_CAP_DEVICE_MSIX_MASK:
> > >  		r = 1;
> > >  		break;
> > >  	
> > >  	case KVM_CAP_COALESCED_MMIO:
> > > diff --git a/include/linux/kvm.h b/include/linux/kvm.h
> > > index 919ae53..eafafb1 100644
> > > --- a/include/linux/kvm.h
> > > +++ b/include/linux/kvm.h
> > > @@ -540,6 +540,9 @@ struct kvm_ppc_pvinfo {
> > > 
> > >  #endif
> > >  #define KVM_CAP_PPC_GET_PVINFO 57
> > >  #define KVM_CAP_PPC_IRQ_LEVEL 58
> > > 
> > > +#ifdef __KVM_HAVE_MSIX
> > > +#define KVM_CAP_DEVICE_MSIX_MASK 59
> > > +#endif
> > 
> > We seem to have these HAVE macros all over.
> > Avi, what's the idea? Let's drop them?
> 
> Um? This kind of marcos are used here to avoid vendor specific string in the header 
> file. 

Am I the only one that dislikes the ifdefs we have all over this code then?
Rest of linux tries to keep ifdefs local by stubbing functionality out.

> > > +	BUG_ON(adev->msix_mmio_base == 0);
> > 
> > Below I see
> > 	adev->msix_mmio_base = msix_mmio->base_addr;
> > which comes from userspace. BUG_ON is an unfriendly way to
> > tell user about a bug in qemu.
> >
> > Anyway, I don't think we should special-case 0 gpa.
> > It's up to the user where to base the table.
> > Use a separate flag to signal that table was initialized.
> 
> Base_addr can't be 0 in any case. Only the unimplemented BAR are hardwired to 
> zero.

No, the PCI spec does not say so. Check it out: the unimplemented BAR
is hardwired to zero but implemented can be set to 0 as well.

> Also if we get here with base_addr = 0, it is surely a bug.
> > > +
> > > +	return -EINVAL;
> > > +}
> > > +
> > > +static int msix_mmio_read(struct kvm_io_device *this, gpa_t addr, int
> > > len, +			  void *val)
> > > +{
> > > +	struct kvm_assigned_dev_kernel *adev =
> > > +			container_of(this, struct kvm_assigned_dev_kernel,
> > > +				     msix_mmio_dev);
> > > +	int idx, r = 0;
> > > +	u32 entry[4];
> > > +	struct kvm_kernel_irq_routing_entry e;
> > > +
> > > +	mutex_lock(&adev->kvm->lock);
> > > +	if (!msix_mmio_in_range(adev, addr, len)) {
> > > +		r = -EOPNOTSUPP;
> > > +		goto out;
> > > +	}
> > > +	if ((addr & 0x3) || len != 4)
> > > +		goto out;
> > > +
> > > +	idx = msix_get_enabled_idx(adev, addr, len);
> > > +	if (idx < 0) {
> > > +		idx = (addr - adev->msix_mmio_base) / PCI_MSIX_ENTRY_SIZE;
> > > +		if ((addr % PCI_MSIX_ENTRY_SIZE) ==
> > > +				PCI_MSIX_ENTRY_VECTOR_CTRL)
> > > +			*(unsigned long *)val =
> > > +				test_bit(idx, adev->msix_mask_bitmap) ?
> > > +				PCI_MSIX_ENTRY_CTRL_MASKBIT : 0;
> > 
> > I suspect this is wrong for big endian platforms: PCI returns
> > all values in little endian format.
> 
> Let big endian platform take care of it later... There is only x86 has 
> __KVM_HAVE_MSIX now.

Well it's easier to make the code correct when you write it than later.

> > 
> > > +		else
> > > +			r = -EOPNOTSUPP;
> > > +		goto out;
> > > +	}
> > > +
> > > +	r = kvm_get_irq_routing_entry(adev->kvm,
> > > +			adev->guest_msix_entries[idx].vector, &e);
> > > +	if (r || e.type != KVM_IRQ_ROUTING_MSI) {
> > > +		printk(KERN_WARNING "KVM: Wrong MSI-X routing entry! "
> > > +			"idx %d, addr 0x%llx, len %d\n", idx, addr, len);
> > 
> > Can a malicious userspace trigger this intentionally?
> 
> No. 

Here's a scenario: bind an MSIX routing entry to device, then later
change its type to regular interrupt. Fills the log with warnings,
doesn't it?

> > 
> > > +		r = -EOPNOTSUPP;
> > > +		goto out;
> > > +	}
> > > +	entry[0] = e.msi.address_lo;
> > > +	entry[1] = e.msi.address_hi;
> > > +	entry[2] = e.msi.data;
> > > +	entry[3] = test_bit(adev->guest_msix_entries[idx].entry,
> > > +			adev->msix_mask_bitmap);
> > 
> > I suspect this is wrong for big endian platforms: PCI returns
> > all values in little endian format.
> > 
> > > +	memcpy(val, &entry[addr % PCI_MSIX_ENTRY_SIZE / 4
> > 
> > 4 is sizeof *entry?
> 
> 4 elements.
> In fact "PCI_MSIX_ENTRY_SIZE / sizeof u32".

Are you sure?  This does not make sense to me.  address in bytes, we
take offset from start of entry in bytes, to figure which dword we need
we divide by size of dword. No?

> > 
> > ], len);
> > 
> > > +
> > > +out:
> > > +	mutex_unlock(&adev->kvm->lock);
> > > +	return r;
> > > +}
> > > +
> > > +static int msix_mmio_write(struct kvm_io_device *this, gpa_t addr, int
> > > len, +			   const void *val)
> > > +{
> > > +	struct kvm_assigned_dev_kernel *adev =
> > > +			container_of(this, struct kvm_assigned_dev_kernel,
> > > +				     msix_mmio_dev);
> > > +	int idx, r = 0;
> > > +	unsigned long new_val = *(unsigned long *)val;
> > 
> > I suspect this is wrong for big endian platforms: PCI returns
> > all values in little endian format.
> > 
> > > +
> > > +	mutex_lock(&adev->kvm->lock);
> > > +	if (!msix_mmio_in_range(adev, addr, len)) {
> > > +		r = -EOPNOTSUPP;
> > > +		goto out;
> > > +	}
> > > +	if ((addr & 0x3) || len != 4)
> > > +		goto out;
> > > +
> > > +	idx = msix_get_enabled_idx(adev, addr, len);
> > > +	if (idx < 0) {
> > > +		idx = (addr - adev->msix_mmio_base) / PCI_MSIX_ENTRY_SIZE;
> > > +		if (((addr % PCI_MSIX_ENTRY_SIZE) ==
> > > +				PCI_MSIX_ENTRY_VECTOR_CTRL)) {
> > > +			if (new_val & ~PCI_MSIX_ENTRY_CTRL_MASKBIT)
> > > +				goto out;
> > > +			if (new_val & PCI_MSIX_ENTRY_CTRL_MASKBIT)
> > > +				set_bit(idx, adev->msix_mask_bitmap);
> > > +			else
> > > +				clear_bit(idx, adev->msix_mask_bitmap);
> > > +		} else
> > > +			/* Userspace would handle other MMIO writing */
> > > +			r = -EOPNOTSUPP;
> > > +		goto out;
> > > +	}
> > > +	if (addr % PCI_MSIX_ENTRY_SIZE != PCI_MSIX_ENTRY_VECTOR_CTRL) {
> > > +		r = -EOPNOTSUPP;
> > > +		goto out;
> > > +	}
> > > +	if (new_val & ~PCI_MSIX_ENTRY_CTRL_MASKBIT)
> > > +		goto out;
> > > +	update_msix_mask(adev, idx, !!(new_val & PCI_MSIX_ENTRY_CTRL_MASKBIT));
> > > +out:
> > > +	mutex_unlock(&adev->kvm->lock);
> > > +
> > > +	return r;
> > > +}
> > > +
> > > +static const struct kvm_io_device_ops msix_mmio_ops = {
> > > +	.read     = msix_mmio_read,
> > > +	.write    = msix_mmio_write,
> > > +};
> > > +
> > > +static int kvm_vm_ioctl_register_msix_mmio(struct kvm *kvm,
> > > +				struct kvm_assigned_msix_mmio *msix_mmio)
> > > +{
> > > +	int r = 0;
> > > +	struct kvm_assigned_dev_kernel *adev;
> > > +
> > > +	mutex_lock(&kvm->lock);
> > > +	adev = kvm_find_assigned_dev(&kvm->arch.assigned_dev_head,
> > > +				      msix_mmio->assigned_dev_id);
> > > +	if (!adev) {
> > > +		r = -EINVAL;
> > > +		goto out;
> > > +	}
> > > +	if (msix_mmio->base_addr == 0) {
> > > +		r = -EINVAL;
> > > +		goto out;
> > > +	}
> > > +
> > > +	mutex_lock(&kvm->slots_lock);
> > > +	if (adev->msix_mmio_base == 0) {
> > 
> > What does this do? Handles case we are already registered?
> > Also ! adev->msix_mmio_base
> 
> We don't need to register again. Just update the address.
> 
> Well, msix_mmio_base is an address. I think ==0 is more properly.

Yes, there's a test above that validates for 0 and returns EINVAL.
But using a separate flag that userspace can not control
to mean 'already registered' will make the code clearer.

> 
> --
> regards
> Yang, Sheng
> 
> > 
> > > +		kvm_iodevice_init(&adev->msix_mmio_dev, &msix_mmio_ops);
> > > +		r = kvm_io_bus_register_dev(kvm, KVM_MMIO_BUS,
> > > +				&adev->msix_mmio_dev);
> > > +		if (r)
> > > +			goto out2;
> > > +	}
> > > +
> > > +	adev->msix_mmio_base = msix_mmio->base_addr;
> > > +out2:
> > > +	mutex_unlock(&kvm->slots_lock);
> > > +out:
> > > +	mutex_unlock(&kvm->lock);
> > > +
> > > +	return r;
> > > +}
> > > 
> > >  #endif
> > >  
> > >  long kvm_vm_ioctl_assigned_device(struct kvm *kvm, unsigned ioctl,
> > > 
> > > @@ -812,6 +1065,36 @@ long kvm_vm_ioctl_assigned_device(struct kvm *kvm,
> > > unsigned ioctl,
> > > 
> > >  			goto out;
> > >  		
> > >  		break;
> > >  	
> > >  	}
> > > 
> > > +	case KVM_ASSIGN_GET_MSIX_ENTRY: {
> > > +		struct kvm_assigned_msix_entry entry;
> > > +		r = -EFAULT;
> > > +		if (copy_from_user(&entry, argp, sizeof entry))
> > > +			goto out;
> > > +		r = kvm_vm_ioctl_get_msix_entry(kvm, &entry);
> > > +		if (r)
> > > +			goto out;
> > > +		r = -EFAULT;
> > > +		if (copy_to_user(argp, &entry, sizeof entry))
> > > +			goto out;
> > > +		r = 0;
> > > +		break;
> > > +	}
> > > +	case KVM_ASSIGN_REG_MSIX_MMIO: {
> > > +		struct kvm_assigned_msix_mmio msix_mmio;
> > > +
> > > +		r = -EFAULT;
> > > +		if (copy_from_user(&msix_mmio, argp, sizeof(msix_mmio)))
> > > +			goto out;
> > > +
> > > +		r = -EINVAL;
> > > +		if (msix_mmio.flags != 0 || msix_mmio.reserved != 0)
> > > +			goto out;
> > > +
> > > +		r = kvm_vm_ioctl_register_msix_mmio(kvm, &msix_mmio);
> > > +		if (r)
> > > +			goto out;
> > > +		break;
> > > +	}
> > > 
> > >  #endif
> > >  
> > >  	}
> > >  
> > >  out:
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html