Re: [PATCH 5/5] KVM: assigned dev: MSI-X mask support

Sheng Yang <sheng@xxxxxxxxxxxxxxx> · Fri, 5 Nov 2010 20:01:53 +0800

On Friday 05 November 2010 18:53:50 Sheng Yang wrote:
> On Friday 05 November 2010 16:43:33 Michael S. Tsirkin wrote:
> > On Fri, Nov 05, 2010 at 10:44:19AM +0800, Sheng Yang wrote:
> > > > > +};
> > > > > +
> > > > > +This ioctl would enable in-kernel MSI-X emulation, which would
> > > > > handle MSI-X +mask bit in the kernel.
> > > > 
> > > > What happens on repeated calls when it's already enabled?
> > > > How does one disable after it's enabled?
> > > 
> > > Suppose this should only be called once. But again would update the
> > > MMIO base.
> > 
> > So maybe add this all to documentation.
> > 
> > > It
> > > would be disabled along with device deassignment.
> > 
> > So what are you going to do when guest enables and later disables MSIX?
> > Disable assignment completely?
> 
> This device goes with PCI resources allocation, rather than IRQ assignment.
> 
> > > We enable it explicitly because
> > > not all devices have MSI-X capability.
> > > 
> > > > > +
> > > > 
> > > > This is very specialized for assigned devices.  I would like an
> > > > interface not tied to assigned devices explicitly
> > > > (even if the implementation at the moment only works
> > > > for assigned devices, I can patch that in later). E.g. vhost-net
> > > > would benefit from such an extension too.  Maybe tie this to a
> > > > special GSI and this GSI can be an assigned device or not?
> > > 
> > > You're talking about KVM_ASSIGN_GET_MSIX_ENTRY?
> > 
> > Yes but also, and more importantly, about KVM_ASSIGN_REG_MSIX_MMIO
> > and kvm_assigned_msix_mmio really.
> > 
> > > We can't tie it to GSI, because we
> > > should also cover entries without GSI. The entry number should be fine
> > > for all kinds of devices using MSI-X.
> > 
> > I don't completely understand: entries without GSI
> > never need to inject an interrupt. Why do we care
> > about them? Let's pass such accesses to userspace.
> > 
> > In any case, I think MSIX GSIs are inexpensive (we never search them
> > all), so we could create GSIs for all entries if we wanted to.
> 
> All mask bit accessing controlled by kernel, in order to keep consistent.
> 
> > > I am not sure about what PV one would looks like.
> > > I use kvm_assigned_msix_entry
> > > just because it can be easily reused.
> > 
> > Not only PV, emulated devices too. OK let's try to figure out:
> > 
> > What happens with userspace is, we inject interrupts either through
> > ioctls or eventfds. Avoiding an exit to userspace on MSIX
> > access is beneficial in exactly the same way.
> > 
> > We could have an ioctl to pass in the table addresses, and mapping
> > table to relevant GSIs. For example, add kvm_msix_mmio with table
> > start/end, also specify which GSIs are covered.
> > Maybe ask that userspace allocate all GSIs for a table consequitively?
> > Or add kvm_irq_routing_msix which is same as msi but
> > also has the mmio addresses? Bonus points for avoiding the need
> > to scan all table on each write. For example, don't scan at all,
> > instead just set a bit in KVM, and set irq entry on the next
> > interrupt only: we have KVM_MAX_MSIX_PER_DEV as 256 so maybe a small 32
> > byte bitmap would be enough for this, avoding a linear scan.
> > 
> > When we pass in kvm_msix_mmio, kvm would start registering masked state.
> > 
> > > > > +4.48 KVM_ASSIGN_GET_MSIX_ENTRY
> > > > > +
> > > > > +Capability: KVM_CAP_DEVICE_MSIX_MASK
> > > > > +Architectures: x86
> > > > > +Type: vm ioctl
> > > > > +Parameters: struct kvm_assigned_msix_entry (in and out)
> > > > > +Returns: 0 on success, !0 on error
> > > > > +
> > > > > +struct kvm_assigned_msix_entry {
> > > > > +	/* Assigned device's ID */
> > > > > +	__u32 assigned_dev_id;
> > > > > +	/* Ignored */
> > > > > +	__u32 gsi;
> > > > > +	/* The index of entry in the MSI-X table */
> > > > > +	__u16 entry;
> > > > > +	/* Querying flags and returning status */
> > > > > +	__u16 flags;
> > > > > +	/* Must be 0 */
> > > > > +	__u16 padding[2];
> > > > > +};
> > > > > +
> > > > > +This ioctl would allow userspace to get the status of one specific
> > > > > MSI-X +entry. Currently we support mask bit status querying.
> > > > > +
> > > > > 
> > > > >  5. The kvm_run structure
> > > > >  
> > > > >  Application code obtains a pointer to the kvm_run structure by
> > > > > 
> > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > > > index f3f86b2..8fd5121 100644
> > > > > --- a/arch/x86/kvm/x86.c
> > > > > +++ b/arch/x86/kvm/x86.c
> > > > > @@ -1926,6 +1926,7 @@ int kvm_dev_ioctl_check_extension(long ext)
> > > > > 
> > > > >  	case KVM_CAP_DEBUGREGS:
> > > > >  	case KVM_CAP_X86_ROBUST_SINGLESTEP:
> > > > > 
> > > > >  	case KVM_CAP_XSAVE:
> > > > > +	case KVM_CAP_DEVICE_MSIX_MASK:
> > > > >  		r = 1;
> > > > >  		break;
> > > > >  	
> > > > >  	case KVM_CAP_COALESCED_MMIO:
> > > > > diff --git a/include/linux/kvm.h b/include/linux/kvm.h
> > > > > index 919ae53..eafafb1 100644
> > > > > --- a/include/linux/kvm.h
> > > > > +++ b/include/linux/kvm.h
> > > > > @@ -540,6 +540,9 @@ struct kvm_ppc_pvinfo {
> > > > > 
> > > > >  #endif
> > > > >  #define KVM_CAP_PPC_GET_PVINFO 57
> > > > >  #define KVM_CAP_PPC_IRQ_LEVEL 58
> > > > > 
> > > > > +#ifdef __KVM_HAVE_MSIX
> > > > > +#define KVM_CAP_DEVICE_MSIX_MASK 59
> > > > > +#endif
> > > > 
> > > > We seem to have these HAVE macros all over.
> > > > Avi, what's the idea? Let's drop them?
> > > 
> > > Um? This kind of marcos are used here to avoid vendor specific string
> > > in the header file.
> > 
> > Am I the only one that dislikes the ifdefs we have all over this code
> > then? Rest of linux tries to keep ifdefs local by stubbing functionality
> > out.
> > 
> > > > > +	BUG_ON(adev->msix_mmio_base == 0);
> > > > 
> > > > Below I see
> > > > 
> > > > 	adev->msix_mmio_base = msix_mmio->base_addr;
> > > > 
> > > > which comes from userspace. BUG_ON is an unfriendly way to
> > > > tell user about a bug in qemu.
> > > > 
> > > > Anyway, I don't think we should special-case 0 gpa.
> > > > It's up to the user where to base the table.
> > > > Use a separate flag to signal that table was initialized.
> > > 
> > > Base_addr can't be 0 in any case. Only the unimplemented BAR are
> > > hardwired to zero.
> > 
> > No, the PCI spec does not say so. Check it out: the unimplemented BAR
> > is hardwired to zero but implemented can be set to 0 as well.
> 
> OK, OK, OK, I would add one more flag.
> 
> > > Also if we get here with base_addr = 0, it is surely a bug.
> > > 
> > > > > +
> > > > > +	return -EINVAL;
> > > > > +}
> > > > > +
> > > > > +static int msix_mmio_read(struct kvm_io_device *this, gpa_t addr,
> > > > > int len, +			  void *val)
> > > > > +{
> > > > > +	struct kvm_assigned_dev_kernel *adev =
> > > > > +			container_of(this, struct kvm_assigned_dev_kernel,
> > > > > +				     msix_mmio_dev);
> > > > > +	int idx, r = 0;
> > > > > +	u32 entry[4];
> > > > > +	struct kvm_kernel_irq_routing_entry e;
> > > > > +
> > > > > +	mutex_lock(&adev->kvm->lock);
> > > > > +	if (!msix_mmio_in_range(adev, addr, len)) {
> > > > > +		r = -EOPNOTSUPP;
> > > > > +		goto out;
> > > > > +	}
> > > > > +	if ((addr & 0x3) || len != 4)
> > > > > +		goto out;
> > > > > +
> > > > > +	idx = msix_get_enabled_idx(adev, addr, len);
> > > > > +	if (idx < 0) {
> > > > > +		idx = (addr - adev->msix_mmio_base) / PCI_MSIX_ENTRY_SIZE;
> > > > > +		if ((addr % PCI_MSIX_ENTRY_SIZE) ==
> > > > > +				PCI_MSIX_ENTRY_VECTOR_CTRL)
> > > > > +			*(unsigned long *)val =
> > > > > +				test_bit(idx, adev->msix_mask_bitmap) ?
> > > > > +				PCI_MSIX_ENTRY_CTRL_MASKBIT : 0;
> > > > 
> > > > I suspect this is wrong for big endian platforms: PCI returns
> > > > all values in little endian format.
> > > 
> > > Let big endian platform take care of it later... There is only x86 has
> > > __KVM_HAVE_MSIX now.
> > 
> > Well it's easier to make the code correct when you write it than later.
> 
> Go back to me if someone want to implement MSI device assignment on
> big-endian machine.

Sorry, just realized it's very likely that I don't have an big endian machine to 
test it even at that time...

I think it's really should be done by someone would use and test it.

--
regards
Yang, Sheng

> 
> > > > > +		else
> > > > > +			r = -EOPNOTSUPP;
> > > > > +		goto out;
> > > > > +	}
> > > > > +
> > > > > +	r = kvm_get_irq_routing_entry(adev->kvm,
> > > > > +			adev->guest_msix_entries[idx].vector, &e);
> > > > > +	if (r || e.type != KVM_IRQ_ROUTING_MSI) {
> > > > > +		printk(KERN_WARNING "KVM: Wrong MSI-X routing entry! "
> > > > > +			"idx %d, addr 0x%llx, len %d\n", idx, addr, len);
> > > > 
> > > > Can a malicious userspace trigger this intentionally?
> > > 
> > > No.
> > 
> > Here's a scenario: bind an MSIX routing entry to device, then later
> > change its type to regular interrupt. Fills the log with warnings,
> > doesn't it?
> 
> Does QEmu suppose to discard the old routing?
> 
> > > > > +		r = -EOPNOTSUPP;
> > > > > +		goto out;
> > > > > +	}
> > > > > +	entry[0] = e.msi.address_lo;
> > > > > +	entry[1] = e.msi.address_hi;
> > > > > +	entry[2] = e.msi.data;
> > > > > +	entry[3] = test_bit(adev->guest_msix_entries[idx].entry,
> > > > > +			adev->msix_mask_bitmap);
> > > > 
> > > > I suspect this is wrong for big endian platforms: PCI returns
> > > > all values in little endian format.
> > > > 
> > > > > +	memcpy(val, &entry[addr % PCI_MSIX_ENTRY_SIZE / 4
> > > > 
> > > > 4 is sizeof *entry?
> > > 
> > > 4 elements.
> > > In fact "PCI_MSIX_ENTRY_SIZE / sizeof u32".
> > 
> > Are you sure?  This does not make sense to me.  address in bytes, we
> > take offset from start of entry in bytes, to figure which dword we need
> > we divide by size of dword. No?
> 
> I've corrected this one.
> 
> --
> regards
> Yang, Sheng
> 
> > > > ], len);
> > > > 
> > > > > +
> > > > > +out:
> > > > > +	mutex_unlock(&adev->kvm->lock);
> > > > > +	return r;
> > > > > +}
> > > > > +
> > > > > +static int msix_mmio_write(struct kvm_io_device *this, gpa_t addr,
> > > > > int len, +			   const void *val)
> > > > > +{
> > > > > +	struct kvm_assigned_dev_kernel *adev =
> > > > > +			container_of(this, struct kvm_assigned_dev_kernel,
> > > > > +				     msix_mmio_dev);
> > > > > +	int idx, r = 0;
> > > > > +	unsigned long new_val = *(unsigned long *)val;
> > > > 
> > > > I suspect this is wrong for big endian platforms: PCI returns
> > > > all values in little endian format.
> > > > 
> > > > > +
> > > > > +	mutex_lock(&adev->kvm->lock);
> > > > > +	if (!msix_mmio_in_range(adev, addr, len)) {
> > > > > +		r = -EOPNOTSUPP;
> > > > > +		goto out;
> > > > > +	}
> > > > > +	if ((addr & 0x3) || len != 4)
> > > > > +		goto out;
> > > > > +
> > > > > +	idx = msix_get_enabled_idx(adev, addr, len);
> > > > > +	if (idx < 0) {
> > > > > +		idx = (addr - adev->msix_mmio_base) / PCI_MSIX_ENTRY_SIZE;
> > > > > +		if (((addr % PCI_MSIX_ENTRY_SIZE) ==
> > > > > +				PCI_MSIX_ENTRY_VECTOR_CTRL)) {
> > > > > +			if (new_val & ~PCI_MSIX_ENTRY_CTRL_MASKBIT)
> > > > > +				goto out;
> > > > > +			if (new_val & PCI_MSIX_ENTRY_CTRL_MASKBIT)
> > > > > +				set_bit(idx, adev->msix_mask_bitmap);
> > > > > +			else
> > > > > +				clear_bit(idx, adev->msix_mask_bitmap);
> > > > > +		} else
> > > > > +			/* Userspace would handle other MMIO writing */
> > > > > +			r = -EOPNOTSUPP;
> > > > > +		goto out;
> > > > > +	}
> > > > > +	if (addr % PCI_MSIX_ENTRY_SIZE != PCI_MSIX_ENTRY_VECTOR_CTRL) {
> > > > > +		r = -EOPNOTSUPP;
> > > > > +		goto out;
> > > > > +	}
> > > > > +	if (new_val & ~PCI_MSIX_ENTRY_CTRL_MASKBIT)
> > > > > +		goto out;
> > > > > +	update_msix_mask(adev, idx, !!(new_val &
> > > > > PCI_MSIX_ENTRY_CTRL_MASKBIT)); +out:
> > > > > +	mutex_unlock(&adev->kvm->lock);
> > > > > +
> > > > > +	return r;
> > > > > +}
> > > > > +
> > > > > +static const struct kvm_io_device_ops msix_mmio_ops = {
> > > > > +	.read     = msix_mmio_read,
> > > > > +	.write    = msix_mmio_write,
> > > > > +};
> > > > > +
> > > > > +static int kvm_vm_ioctl_register_msix_mmio(struct kvm *kvm,
> > > > > +				struct kvm_assigned_msix_mmio *msix_mmio)
> > > > > +{
> > > > > +	int r = 0;
> > > > > +	struct kvm_assigned_dev_kernel *adev;
> > > > > +
> > > > > +	mutex_lock(&kvm->lock);
> > > > > +	adev = kvm_find_assigned_dev(&kvm->arch.assigned_dev_head,
> > > > > +				      msix_mmio->assigned_dev_id);
> > > > > +	if (!adev) {
> > > > > +		r = -EINVAL;
> > > > > +		goto out;
> > > > > +	}
> > > > > +	if (msix_mmio->base_addr == 0) {
> > > > > +		r = -EINVAL;
> > > > > +		goto out;
> > > > > +	}
> > > > > +
> > > > > +	mutex_lock(&kvm->slots_lock);
> > > > > +	if (adev->msix_mmio_base == 0) {
> > > > 
> > > > What does this do? Handles case we are already registered?
> > > > Also ! adev->msix_mmio_base
> > > 
> > > We don't need to register again. Just update the address.
> > > 
> > > Well, msix_mmio_base is an address. I think ==0 is more properly.
> > 
> > Yes, there's a test above that validates for 0 and returns EINVAL.
> > But using a separate flag that userspace can not control
> > to mean 'already registered' will make the code clearer.
> > 
> > > --
> > > regards
> > > Yang, Sheng
> > > 
> > > > > +		kvm_iodevice_init(&adev->msix_mmio_dev, &msix_mmio_ops);
> > > > > +		r = kvm_io_bus_register_dev(kvm, KVM_MMIO_BUS,
> > > > > +				&adev->msix_mmio_dev);
> > > > > +		if (r)
> > > > > +			goto out2;
> > > > > +	}
> > > > > +
> > > > > +	adev->msix_mmio_base = msix_mmio->base_addr;
> > > > > +out2:
> > > > > +	mutex_unlock(&kvm->slots_lock);
> > > > > +out:
> > > > > +	mutex_unlock(&kvm->lock);
> > > > > +
> > > > > +	return r;
> > > > > +}
> > > > > 
> > > > >  #endif
> > > > >  
> > > > >  long kvm_vm_ioctl_assigned_device(struct kvm *kvm, unsigned ioctl,
> > > > > 
> > > > > @@ -812,6 +1065,36 @@ long kvm_vm_ioctl_assigned_device(struct kvm
> > > > > *kvm, unsigned ioctl,
> > > > > 
> > > > >  			goto out;
> > > > >  		
> > > > >  		break;
> > > > >  	
> > > > >  	}
> > > > > 
> > > > > +	case KVM_ASSIGN_GET_MSIX_ENTRY: {
> > > > > +		struct kvm_assigned_msix_entry entry;
> > > > > +		r = -EFAULT;
> > > > > +		if (copy_from_user(&entry, argp, sizeof entry))
> > > > > +			goto out;
> > > > > +		r = kvm_vm_ioctl_get_msix_entry(kvm, &entry);
> > > > > +		if (r)
> > > > > +			goto out;
> > > > > +		r = -EFAULT;
> > > > > +		if (copy_to_user(argp, &entry, sizeof entry))
> > > > > +			goto out;
> > > > > +		r = 0;
> > > > > +		break;
> > > > > +	}
> > > > > +	case KVM_ASSIGN_REG_MSIX_MMIO: {
> > > > > +		struct kvm_assigned_msix_mmio msix_mmio;
> > > > > +
> > > > > +		r = -EFAULT;
> > > > > +		if (copy_from_user(&msix_mmio, argp, sizeof(msix_mmio)))
> > > > > +			goto out;
> > > > > +
> > > > > +		r = -EINVAL;
> > > > > +		if (msix_mmio.flags != 0 || msix_mmio.reserved != 0)
> > > > > +			goto out;
> > > > > +
> > > > > +		r = kvm_vm_ioctl_register_msix_mmio(kvm, &msix_mmio);
> > > > > +		if (r)
> > > > > +			goto out;
> > > > > +		break;
> > > > > +	}
> > > > > 
> > > > >  #endif
> > > > >  
> > > > >  	}
> > > > >  
> > > > >  out:
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html