Re: [PATCH 5/5] KVM: assigned dev: MSI-X mask support

Sheng Yang <sheng@xxxxxxxxxxxxxxx> · Fri, 5 Nov 2010 18:53:50 +0800

On Friday 05 November 2010 16:43:33 Michael S. Tsirkin wrote:
> On Fri, Nov 05, 2010 at 10:44:19AM +0800, Sheng Yang wrote:
> > > > +};
> > > > +
> > > > +This ioctl would enable in-kernel MSI-X emulation, which would
> > > > handle MSI-X +mask bit in the kernel.
> > > 
> > > What happens on repeated calls when it's already enabled?
> > > How does one disable after it's enabled?
> > 
> > Suppose this should only be called once. But again would update the MMIO
> > base.
> 
> So maybe add this all to documentation.
> 
> > It
> > would be disabled along with device deassignment.
> 
> So what are you going to do when guest enables and later disables MSIX?
> Disable assignment completely?

This device goes with PCI resources allocation, rather than IRQ assignment.
> 
> > We enable it explicitly because
> > not all devices have MSI-X capability.
> > 
> > > > +
> > > 
> > > This is very specialized for assigned devices.  I would like an
> > > interface not tied to assigned devices explicitly
> > > (even if the implementation at the moment only works
> > > for assigned devices, I can patch that in later). E.g. vhost-net would
> > > benefit from such an extension too.  Maybe tie this to a special
> > > GSI and this GSI can be an assigned device or not?
> > 
> > You're talking about KVM_ASSIGN_GET_MSIX_ENTRY?
> 
> Yes but also, and more importantly, about KVM_ASSIGN_REG_MSIX_MMIO
> and kvm_assigned_msix_mmio really.
> 
> > We can't tie it to GSI, because we
> > should also cover entries without GSI. The entry number should be fine
> > for all kinds of devices using MSI-X.
> 
> I don't completely understand: entries without GSI
> never need to inject an interrupt. Why do we care
> about them? Let's pass such accesses to userspace.
>
> In any case, I think MSIX GSIs are inexpensive (we never search them
> all), so we could create GSIs for all entries if we wanted to.

All mask bit accessing controlled by kernel, in order to keep consistent.
> 
> > I am not sure about what PV one would looks like.
> > I use kvm_assigned_msix_entry
> > just because it can be easily reused.
> 
> Not only PV, emulated devices too. OK let's try to figure out:
> 
> What happens with userspace is, we inject interrupts either through
> ioctls or eventfds. Avoiding an exit to userspace on MSIX
> access is beneficial in exactly the same way.
> 
> We could have an ioctl to pass in the table addresses, and mapping
> table to relevant GSIs. For example, add kvm_msix_mmio with table
> start/end, also specify which GSIs are covered.
> Maybe ask that userspace allocate all GSIs for a table consequitively?
> Or add kvm_irq_routing_msix which is same as msi but
> also has the mmio addresses? Bonus points for avoiding the need
> to scan all table on each write. For example, don't scan at all,
> instead just set a bit in KVM, and set irq entry on the next
> interrupt only: we have KVM_MAX_MSIX_PER_DEV as 256 so maybe a small 32
> byte bitmap would be enough for this, avoding a linear scan.
> 
> When we pass in kvm_msix_mmio, kvm would start registering masked state.
> 
> > > > +4.48 KVM_ASSIGN_GET_MSIX_ENTRY
> > > > +
> > > > +Capability: KVM_CAP_DEVICE_MSIX_MASK
> > > > +Architectures: x86
> > > > +Type: vm ioctl
> > > > +Parameters: struct kvm_assigned_msix_entry (in and out)
> > > > +Returns: 0 on success, !0 on error
> > > > +
> > > > +struct kvm_assigned_msix_entry {
> > > > +	/* Assigned device's ID */
> > > > +	__u32 assigned_dev_id;
> > > > +	/* Ignored */
> > > > +	__u32 gsi;
> > > > +	/* The index of entry in the MSI-X table */
> > > > +	__u16 entry;
> > > > +	/* Querying flags and returning status */
> > > > +	__u16 flags;
> > > > +	/* Must be 0 */
> > > > +	__u16 padding[2];
> > > > +};
> > > > +
> > > > +This ioctl would allow userspace to get the status of one specific
> > > > MSI-X +entry. Currently we support mask bit status querying.
> > > > +
> > > > 
> > > >  5. The kvm_run structure
> > > >  
> > > >  Application code obtains a pointer to the kvm_run structure by
> > > > 
> > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > > index f3f86b2..8fd5121 100644
> > > > --- a/arch/x86/kvm/x86.c
> > > > +++ b/arch/x86/kvm/x86.c
> > > > @@ -1926,6 +1926,7 @@ int kvm_dev_ioctl_check_extension(long ext)
> > > > 
> > > >  	case KVM_CAP_DEBUGREGS:
> > > >  	case KVM_CAP_X86_ROBUST_SINGLESTEP:
> > > > 
> > > >  	case KVM_CAP_XSAVE:
> > > > +	case KVM_CAP_DEVICE_MSIX_MASK:
> > > >  		r = 1;
> > > >  		break;
> > > >  	
> > > >  	case KVM_CAP_COALESCED_MMIO:
> > > > diff --git a/include/linux/kvm.h b/include/linux/kvm.h
> > > > index 919ae53..eafafb1 100644
> > > > --- a/include/linux/kvm.h
> > > > +++ b/include/linux/kvm.h
> > > > @@ -540,6 +540,9 @@ struct kvm_ppc_pvinfo {
> > > > 
> > > >  #endif
> > > >  #define KVM_CAP_PPC_GET_PVINFO 57
> > > >  #define KVM_CAP_PPC_IRQ_LEVEL 58
> > > > 
> > > > +#ifdef __KVM_HAVE_MSIX
> > > > +#define KVM_CAP_DEVICE_MSIX_MASK 59
> > > > +#endif
> > > 
> > > We seem to have these HAVE macros all over.
> > > Avi, what's the idea? Let's drop them?
> > 
> > Um? This kind of marcos are used here to avoid vendor specific string in
> > the header file.
> 
> Am I the only one that dislikes the ifdefs we have all over this code then?
> Rest of linux tries to keep ifdefs local by stubbing functionality out.
> 
> > > > +	BUG_ON(adev->msix_mmio_base == 0);
> > > 
> > > Below I see
> > > 
> > > 	adev->msix_mmio_base = msix_mmio->base_addr;
> > > 
> > > which comes from userspace. BUG_ON is an unfriendly way to
> > > tell user about a bug in qemu.
> > > 
> > > Anyway, I don't think we should special-case 0 gpa.
> > > It's up to the user where to base the table.
> > > Use a separate flag to signal that table was initialized.
> > 
> > Base_addr can't be 0 in any case. Only the unimplemented BAR are
> > hardwired to zero.
> 
> No, the PCI spec does not say so. Check it out: the unimplemented BAR
> is hardwired to zero but implemented can be set to 0 as well.

OK, OK, OK, I would add one more flag.
>
> > Also if we get here with base_addr = 0, it is surely a bug.
> > 
> > > > +
> > > > +	return -EINVAL;
> > > > +}
> > > > +
> > > > +static int msix_mmio_read(struct kvm_io_device *this, gpa_t addr,
> > > > int len, +			  void *val)
> > > > +{
> > > > +	struct kvm_assigned_dev_kernel *adev =
> > > > +			container_of(this, struct kvm_assigned_dev_kernel,
> > > > +				     msix_mmio_dev);
> > > > +	int idx, r = 0;
> > > > +	u32 entry[4];
> > > > +	struct kvm_kernel_irq_routing_entry e;
> > > > +
> > > > +	mutex_lock(&adev->kvm->lock);
> > > > +	if (!msix_mmio_in_range(adev, addr, len)) {
> > > > +		r = -EOPNOTSUPP;
> > > > +		goto out;
> > > > +	}
> > > > +	if ((addr & 0x3) || len != 4)
> > > > +		goto out;
> > > > +
> > > > +	idx = msix_get_enabled_idx(adev, addr, len);
> > > > +	if (idx < 0) {
> > > > +		idx = (addr - adev->msix_mmio_base) / PCI_MSIX_ENTRY_SIZE;
> > > > +		if ((addr % PCI_MSIX_ENTRY_SIZE) ==
> > > > +				PCI_MSIX_ENTRY_VECTOR_CTRL)
> > > > +			*(unsigned long *)val =
> > > > +				test_bit(idx, adev->msix_mask_bitmap) ?
> > > > +				PCI_MSIX_ENTRY_CTRL_MASKBIT : 0;
> > > 
> > > I suspect this is wrong for big endian platforms: PCI returns
> > > all values in little endian format.
> > 
> > Let big endian platform take care of it later... There is only x86 has
> > __KVM_HAVE_MSIX now.
> 
> Well it's easier to make the code correct when you write it than later.

Go back to me if someone want to implement MSI device assignment on big-endian 
machine.

> > > > +		else
> > > > +			r = -EOPNOTSUPP;
> > > > +		goto out;
> > > > +	}
> > > > +
> > > > +	r = kvm_get_irq_routing_entry(adev->kvm,
> > > > +			adev->guest_msix_entries[idx].vector, &e);
> > > > +	if (r || e.type != KVM_IRQ_ROUTING_MSI) {
> > > > +		printk(KERN_WARNING "KVM: Wrong MSI-X routing entry! "
> > > > +			"idx %d, addr 0x%llx, len %d\n", idx, addr, len);
> > > 
> > > Can a malicious userspace trigger this intentionally?
> > 
> > No.
> 
> Here's a scenario: bind an MSIX routing entry to device, then later
> change its type to regular interrupt. Fills the log with warnings,
> doesn't it?

Does QEmu suppose to discard the old routing?

> > > > +		r = -EOPNOTSUPP;
> > > > +		goto out;
> > > > +	}
> > > > +	entry[0] = e.msi.address_lo;
> > > > +	entry[1] = e.msi.address_hi;
> > > > +	entry[2] = e.msi.data;
> > > > +	entry[3] = test_bit(adev->guest_msix_entries[idx].entry,
> > > > +			adev->msix_mask_bitmap);
> > > 
> > > I suspect this is wrong for big endian platforms: PCI returns
> > > all values in little endian format.
> > > 
> > > > +	memcpy(val, &entry[addr % PCI_MSIX_ENTRY_SIZE / 4
> > > 
> > > 4 is sizeof *entry?
> > 
> > 4 elements.
> > In fact "PCI_MSIX_ENTRY_SIZE / sizeof u32".
> 
> Are you sure?  This does not make sense to me.  address in bytes, we
> take offset from start of entry in bytes, to figure which dword we need
> we divide by size of dword. No?

I've corrected this one.

--
regards
Yang, Sheng

> 
> > > ], len);
> > > 
> > > > +
> > > > +out:
> > > > +	mutex_unlock(&adev->kvm->lock);
> > > > +	return r;
> > > > +}
> > > > +
> > > > +static int msix_mmio_write(struct kvm_io_device *this, gpa_t addr,
> > > > int len, +			   const void *val)
> > > > +{
> > > > +	struct kvm_assigned_dev_kernel *adev =
> > > > +			container_of(this, struct kvm_assigned_dev_kernel,
> > > > +				     msix_mmio_dev);
> > > > +	int idx, r = 0;
> > > > +	unsigned long new_val = *(unsigned long *)val;
> > > 
> > > I suspect this is wrong for big endian platforms: PCI returns
> > > all values in little endian format.
> > > 
> > > > +
> > > > +	mutex_lock(&adev->kvm->lock);
> > > > +	if (!msix_mmio_in_range(adev, addr, len)) {
> > > > +		r = -EOPNOTSUPP;
> > > > +		goto out;
> > > > +	}
> > > > +	if ((addr & 0x3) || len != 4)
> > > > +		goto out;
> > > > +
> > > > +	idx = msix_get_enabled_idx(adev, addr, len);
> > > > +	if (idx < 0) {
> > > > +		idx = (addr - adev->msix_mmio_base) / PCI_MSIX_ENTRY_SIZE;
> > > > +		if (((addr % PCI_MSIX_ENTRY_SIZE) ==
> > > > +				PCI_MSIX_ENTRY_VECTOR_CTRL)) {
> > > > +			if (new_val & ~PCI_MSIX_ENTRY_CTRL_MASKBIT)
> > > > +				goto out;
> > > > +			if (new_val & PCI_MSIX_ENTRY_CTRL_MASKBIT)
> > > > +				set_bit(idx, adev->msix_mask_bitmap);
> > > > +			else
> > > > +				clear_bit(idx, adev->msix_mask_bitmap);
> > > > +		} else
> > > > +			/* Userspace would handle other MMIO writing */
> > > > +			r = -EOPNOTSUPP;
> > > > +		goto out;
> > > > +	}
> > > > +	if (addr % PCI_MSIX_ENTRY_SIZE != PCI_MSIX_ENTRY_VECTOR_CTRL) {
> > > > +		r = -EOPNOTSUPP;
> > > > +		goto out;
> > > > +	}
> > > > +	if (new_val & ~PCI_MSIX_ENTRY_CTRL_MASKBIT)
> > > > +		goto out;
> > > > +	update_msix_mask(adev, idx, !!(new_val &
> > > > PCI_MSIX_ENTRY_CTRL_MASKBIT)); +out:
> > > > +	mutex_unlock(&adev->kvm->lock);
> > > > +
> > > > +	return r;
> > > > +}
> > > > +
> > > > +static const struct kvm_io_device_ops msix_mmio_ops = {
> > > > +	.read     = msix_mmio_read,
> > > > +	.write    = msix_mmio_write,
> > > > +};
> > > > +
> > > > +static int kvm_vm_ioctl_register_msix_mmio(struct kvm *kvm,
> > > > +				struct kvm_assigned_msix_mmio *msix_mmio)
> > > > +{
> > > > +	int r = 0;
> > > > +	struct kvm_assigned_dev_kernel *adev;
> > > > +
> > > > +	mutex_lock(&kvm->lock);
> > > > +	adev = kvm_find_assigned_dev(&kvm->arch.assigned_dev_head,
> > > > +				      msix_mmio->assigned_dev_id);
> > > > +	if (!adev) {
> > > > +		r = -EINVAL;
> > > > +		goto out;
> > > > +	}
> > > > +	if (msix_mmio->base_addr == 0) {
> > > > +		r = -EINVAL;
> > > > +		goto out;
> > > > +	}
> > > > +
> > > > +	mutex_lock(&kvm->slots_lock);
> > > > +	if (adev->msix_mmio_base == 0) {
> > > 
> > > What does this do? Handles case we are already registered?
> > > Also ! adev->msix_mmio_base
> > 
> > We don't need to register again. Just update the address.
> > 
> > Well, msix_mmio_base is an address. I think ==0 is more properly.
> 
> Yes, there's a test above that validates for 0 and returns EINVAL.
> But using a separate flag that userspace can not control
> to mean 'already registered' will make the code clearer.
> 
> > --
> > regards
> > Yang, Sheng
> > 
> > > > +		kvm_iodevice_init(&adev->msix_mmio_dev, &msix_mmio_ops);
> > > > +		r = kvm_io_bus_register_dev(kvm, KVM_MMIO_BUS,
> > > > +				&adev->msix_mmio_dev);
> > > > +		if (r)
> > > > +			goto out2;
> > > > +	}
> > > > +
> > > > +	adev->msix_mmio_base = msix_mmio->base_addr;
> > > > +out2:
> > > > +	mutex_unlock(&kvm->slots_lock);
> > > > +out:
> > > > +	mutex_unlock(&kvm->lock);
> > > > +
> > > > +	return r;
> > > > +}
> > > > 
> > > >  #endif
> > > >  
> > > >  long kvm_vm_ioctl_assigned_device(struct kvm *kvm, unsigned ioctl,
> > > > 
> > > > @@ -812,6 +1065,36 @@ long kvm_vm_ioctl_assigned_device(struct kvm
> > > > *kvm, unsigned ioctl,
> > > > 
> > > >  			goto out;
> > > >  		
> > > >  		break;
> > > >  	
> > > >  	}
> > > > 
> > > > +	case KVM_ASSIGN_GET_MSIX_ENTRY: {
> > > > +		struct kvm_assigned_msix_entry entry;
> > > > +		r = -EFAULT;
> > > > +		if (copy_from_user(&entry, argp, sizeof entry))
> > > > +			goto out;
> > > > +		r = kvm_vm_ioctl_get_msix_entry(kvm, &entry);
> > > > +		if (r)
> > > > +			goto out;
> > > > +		r = -EFAULT;
> > > > +		if (copy_to_user(argp, &entry, sizeof entry))
> > > > +			goto out;
> > > > +		r = 0;
> > > > +		break;
> > > > +	}
> > > > +	case KVM_ASSIGN_REG_MSIX_MMIO: {
> > > > +		struct kvm_assigned_msix_mmio msix_mmio;
> > > > +
> > > > +		r = -EFAULT;
> > > > +		if (copy_from_user(&msix_mmio, argp, sizeof(msix_mmio)))
> > > > +			goto out;
> > > > +
> > > > +		r = -EINVAL;
> > > > +		if (msix_mmio.flags != 0 || msix_mmio.reserved != 0)
> > > > +			goto out;
> > > > +
> > > > +		r = kvm_vm_ioctl_register_msix_mmio(kvm, &msix_mmio);
> > > > +		if (r)
> > > > +			goto out;
> > > > +		break;
> > > > +	}
> > > > 
> > > >  #endif
> > > >  
> > > >  	}
> > > >  
> > > >  out:
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html