On Fri, Oct 22, 2010 at 09:30:09PM +0800, Sheng Yang wrote: > On Friday 22 October 2010 18:17:05 Michael S. Tsirkin wrote: > > On Fri, Oct 22, 2010 at 12:42:43PM +0800, Sheng Yang wrote: > > > On Thursday 21 October 2010 16:39:07 Michael S. Tsirkin wrote: > > > > On Thu, Oct 21, 2010 at 04:30:02PM +0800, Sheng Yang wrote: > > > > > On Wednesday 20 October 2010 16:26:32 Sheng Yang wrote: > > > > > > This patch enable per-vector mask for assigned devices using MSI-X. > > > > > > > > > > The basic idea of kernel and QEmu's responsibilities are: > > > > > > > > > > 1. Because QEmu owned the irq routing table, so the change of table > > > > > should still go to the QEmu, like we did in msix_mmio_write(). > > > > > > > > > > 2. And the others things can be done in kernel, for performance. Here > > > > > we covered the reading(converted entry from routing table and mask > > > > > bit state of enabled MSI-X entries), and writing the mask bit for > > > > > enabled MSI-X entries. Originally we only has mask bit handled in > > > > > kernel, but later we found that Linux kernel would read MSI-X mmio > > > > > just after every writing to mask bit, in order to flush the writing. > > > > > So we add reading MSI data/addr as well. > > > > > > > > > > 3. Disabled entries's mask bit accessing would go to QEmu, because it > > > > > may result in disable/enable MSI-X. Explained later. > > > > > > > > > > 4. Only QEmu has knowledge of PCI configuration space, so it's QEmu > > > > > to decide enable/disable MSI-X for device. > > > > > . > > > > > > > > Config space yes, but it's a simple global yes/no after all. > > > > > > > > > 5. There is an distinction between enabled entry and disabled entry > > > > > of MSI-X table. > > > > > > > > That's my point. There's no such thing as 'enabled entries' > > > > in the spec. There are only masked and unmasked entries. > > > > > > > > Current interface deals with gsi numbers so qemu had to work around > > > > this. The hack used there is removing gsi for masked vector which has 0 > > > > address and data. It works because this is what linux and windows > > > > guests happen to do, but it is out of spec: vector/data value for a > > > > masked entry have no meaning. > > > > > > Well, I just realized something unnatural about the 0 contained > > > data/address entry. So I checked spec again, and found the mask bit > > > should be set after reset. So after fix this, I think unmasked 0 > > > address/data entry shouldn't be there anymore. > > > > You are right that this 0 check is completely out of spec. But see > > below. The issue is the spec does not require you to mask an entry if > > the device does not use it. And we do not want to waste host vectors. > > One can imagine a logic where we would detect that an interrupt > > has not been used in a long while, mask it and give up > > a host vector. Then check the pending bit once in a while to > > see whether device started using it. > > I don't think introducing this complex and speculative logic makes sense. I > haven't seen any alike scenario yet. What's the issue of current implemenation? Main issue I think is that current implementation assumes that when MSI-X is enabled, all vectors that are masked or have 0 value will not be used anymore. It says so no where in spec. > > > > Since you are building a new interface, can design it without > > > > constraints... > > > > > > A constraint is pci_enable_msix(). > > > > It's an internal kernel API. No one prevents us from changing it. > > You can say no one. But I'm afraid if we want to overhaul this kind of core PCI > functions, it may be take months to get it checked in upstream - also assume we > can persuade them this overhaul is absolutely needed (I hope I'm wrong on this). > This may also means we have to find out another user for this kind of change. No, usually a single user is enough to add internal APIs :) This is because we do not promise backwards compatibility there. > The > key issue is I don't know what we can gain for certain from it. Current > disable/enable mechanism still works well. Heh, given that things seem to work without mask support at all, I am not sure what does this mean. > I don't know why we need spend a lot of > effort on this just because spec don't say there is "enabled/disabled entries". I would like to see an implementation using 100% architectural features. It's enough that we need to worry about device quirks. Relying on guests to only use them in a specific way means each bug we'll have to worry whether guest is doing something we did not expect. If something is just too hard to implement, a temporary work around would be to still build an interfae to allow this, and then add a trace in kernel making this easy to detect. Then we can fix this fully in the next version without affecting userspace. > Yes it's not that elegant, but we need carefully think if the effort worth it. What I think we should be careful about is making sure we avoid making kerne/userspace API depend on an internal kernel one. Supporing such API long term when internal one changes will be much more painful. > > > > > We have to use it to allocate irq for each > > > entry, as well as program the entry in the real hardware. > > > pci_enable_msix() is only a yes/no choice. We can't add new enabled > > > entries after pci_enable_msix(), > > > > With the current APIs. > > > > > and we can only enable/disable/mask/unmask one IRQ through kernel API, > > > not the entry in the MSI-X table. And we still have to allocate new IRQ > > > for new entry. When guest unmask "disabled entry", we have to disable > > > and enable MSI-X again in order to use the new entry. That's why > > > "enabled/disabled entry" concept existed. > > > > > > So even guest only unmasked one entry, it's a totally different work for > > > KVM underlaying. This logic won't change no matter where the MMIO > > > handling is. And in fact I don't like this kind of tricky things in > > > kernel... > > > > A more fundamental problem is that host vectors are a limited resource, > > we don't want to waste them on entries that will end up unused. > > One can imagine some kind of logic where we check the pending > > bit on a masked entry and after a while give up the host vector. > > For the data/address != 0 entries, I haven't seen any of them leave unused in the > end. So I don't understand your point here. I'll try to give another example. Imagine a guest driver loaded and using N vectors. It gets unloaded, then another driver is loaded using N-1 vectors. vector N is unused but it is != 0 so we will think it is used and will allocate an entry for it. > > > > This is what I said: making it spec compliant is harder as it will > > need core kernel changes. Still, it seems silly to design a > > kerne/userspace API around an internal API limitation ... > > I still don't think we violate the spec here, in which case we fail to comply the > spec? If some devices want to send 0 message to 0 address on x86, it's fail to > comply the x86 spec. Guest can write a non-0 value there and use the vector afterwards. If we allocate vectors upfront we won't be ablke to support this. In other words, it works but it's just a heuristic. It's nice that it works usually, on x86, but we are just lucky and I suspect it will fail in strange scenarious such as driver change, because it is not architectural in the spec. > And I know the current implementation is not elegant due to internal API > limitation, but only the word "change" is not enough. Function to enable separate > entry is of course good to have, but I still think we don't have enough reason for > doing it. Implementation is one thing, kernel/user API is another one. I think we should try to make the second one futureproof but the implementation can be partial, have some TODO items and it's okay, and even a good idea to build things step by step. > -- > regards > Yang, Sheng > > > > > > > The entries we had used for pci_enable_msix()(not necessary in > > > > > sequence number) are already enabled, the others are disabled. When > > > > > device's MSI-X is enabled and guest want to enable an disabled > > > > > entry, we would go back to QEmu because this vector didn't exist in > > > > > the routing table. Also due to pci_enable_msix() in kernel didn't > > > > > allow us to enable vectors one by one, but all at once. So we have > > > > > to disable MSI-X first, then enable it with new entries, which > > > > > contained the new vector guest want to use. This situation is only > > > > > happen when device is being initialized. After that, kernel can know > > > > > and handle the mask bit of the enabled entry. > > > > > > > > > > I've also considered handle all MMIO operation in kernel, and > > > > > changing irq routing in kernel directly. But as long as irq routing > > > > > is owned by QEmu, I think it's better to leave to it... > > > > > > > > Yes, this is my suggestion, except we don't need no routing :) > > > > To inject MSI you just need address/data pairs. > > > > Look at kvm_set_msi: given address/data you can just inject > > > > the interrupt. No need for table lookups. > > > > > > You still need to look up data/address pair in the guest MSI-X table. The > > > routing table used here is just an replacement for the table, because we > > > can construct the entry according to the routing table. Two choices, > > > using the routing table, or creating an new MSI-X table. > > > > > > Still, the key is about who to own the routing/MSI-X table. If kernel own > > > it, it would be straightforward to intercept all the MMIO in the kernel; > > > but if it's QEmu, we still need go back to QEmu for it. > > > > Looks cleaner to do it in kernel... > > > > > > > Notice the mask/unmask bits must be handled together, either in > > > > > kernel or in userspace. Because if kernel has handled enabled > > > > > vector's mask bit directly, it would be unsync with QEmu's records. > > > > > It doesn't matter when QEmu don't access the related record. And the > > > > > only place QEmu want to consult it's enabled entries' mask bit state > > > > > is writing to MSI addr/data. The writing should be discarded if the > > > > > entry is unmasked. This checking has already been done by kernel in > > > > > this patchset, so we are fine here. > > > > > > > > > > If we want to access the enabled entries' mask bit in the future, we > > > > > can directly access device's MMIO. > > > > > > > > We really must implement this for correctness, btw. If you do not pass > > > > reads to the device, messages intended for the masked entry > > > > might still be in flight. > > > > > > Oh, yes, kernel would also mask the device as well. I would take this > > > into consideration. > > > > > > -- > > > regards > > > Yang, Sheng > > > > > > > > That's the reason why I have followed Michael's advice to use > > > > > mask/unmask directly. > > > > > Hope this would make the patches more clear. I meant to add comments > > > > > for this changeset, but miss it later. > > > > > > > > > > -- > > > > > regards > > > > > Yang, Sheng > > > > > > > > > > > Signed-off-by: Sheng Yang <sheng@xxxxxxxxxxxxxxx> > > > > > > --- > > > > > > > > > > > > Documentation/kvm/api.txt | 22 ++++++++++++++++ > > > > > > arch/x86/kvm/x86.c | 6 ++++ > > > > > > include/linux/kvm.h | 8 +++++- > > > > > > virt/kvm/assigned-dev.c | 60 > > > > > > > > > > > > +++++++++++++++++++++++++++++++++++++++++++++ 4 files > changed, > > > > > > 95 > > > > > > > > > insertions(+), 1 deletions(-) > > > > > > > > > > > > diff --git a/Documentation/kvm/api.txt b/Documentation/kvm/api.txt > > > > > > index d82d637..f324a50 100644 > > > > > > --- a/Documentation/kvm/api.txt > > > > > > +++ b/Documentation/kvm/api.txt > > > > > > @@ -1087,6 +1087,28 @@ of 4 instructions that make up a hypercall. > > > > > > > > > > > > If any additional field gets added to this structure later on, a > > > > > > bit for > > > > > > > > > > > > that additional piece of information will be set in the flags > > > > > > bitmap. > > > > > > > > > > > > +4.47 KVM_ASSIGN_REG_MSIX_MMIO > > > > > > + > > > > > > +Capability: KVM_CAP_DEVICE_MSIX_MASK > > > > > > +Architectures: x86 > > > > > > +Type: vm ioctl > > > > > > +Parameters: struct kvm_assigned_msix_mmio (in) > > > > > > +Returns: 0 on success, !0 on error > > > > > > + > > > > > > +struct kvm_assigned_msix_mmio { > > > > > > + /* Assigned device's ID */ > > > > > > + __u32 assigned_dev_id; > > > > > > + /* MSI-X table MMIO address */ > > > > > > + __u64 base_addr; > > > > > > + /* Must be 0 */ > > > > > > + __u32 flags; > > > > > > + /* Must be 0, reserved for future use */ > > > > > > + __u64 reserved; > > > > > > +}; > > > > > > + > > > > > > +This ioctl would enable in-kernel MSI-X emulation, which would > > > > > > handle MSI-X +mask bit in the kernel. > > > > > > + > > > > > > > > > > > > 5. The kvm_run structure > > > > > > > > > > > > Application code obtains a pointer to the kvm_run structure by > > > > > > > > > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > > > > > > index fc62546..ba07a2f 100644 > > > > > > --- a/arch/x86/kvm/x86.c > > > > > > +++ b/arch/x86/kvm/x86.c > > > > > > @@ -1927,6 +1927,8 @@ int kvm_dev_ioctl_check_extension(long ext) > > > > > > > > > > > > case KVM_CAP_X86_ROBUST_SINGLESTEP: > > > > > > case KVM_CAP_XSAVE: > > > > > > > > > > > > case KVM_CAP_ENABLE_CAP: > > > > > > + case KVM_CAP_DEVICE_MSIX_EXT: > > > > > > > > > > > > + case KVM_CAP_DEVICE_MSIX_MASK: > > > > > > r = 1; > > > > > > break; > > > > > > > > > > > > case KVM_CAP_COALESCED_MMIO: > > > > > > @@ -2717,6 +2719,10 @@ static int kvm_vcpu_ioctl_enable_cap(struct > > > > > > kvm_vcpu *vcpu, return -EINVAL; > > > > > > > > > > > > switch (cap->cap) { > > > > > > > > > > > > + case KVM_CAP_DEVICE_MSIX_EXT: > > > > > > + vcpu->kvm->arch.msix_flags_enabled = true; > > > > > > + r = 0; > > > > > > + break; > > > > > > > > > > > > default: > > > > > > r = -EINVAL; > > > > > > break; > > > > > > > > > > > > diff --git a/include/linux/kvm.h b/include/linux/kvm.h > > > > > > index 0a7bd34..1494ed0 100644 > > > > > > --- a/include/linux/kvm.h > > > > > > +++ b/include/linux/kvm.h > > > > > > @@ -540,6 +540,10 @@ struct kvm_ppc_pvinfo { > > > > > > > > > > > > #endif > > > > > > #define KVM_CAP_PPC_GET_PVINFO 57 > > > > > > #define KVM_CAP_PPC_IRQ_LEVEL 58 > > > > > > > > > > > > +#ifdef __KVM_HAVE_MSIX > > > > > > +#define KVM_CAP_DEVICE_MSIX_EXT 59 > > > > > > +#define KVM_CAP_DEVICE_MSIX_MASK 60 > > > > > > +#endif > > > > > > > > > > > > #ifdef KVM_CAP_IRQ_ROUTING > > > > > > > > > > > > @@ -671,6 +675,8 @@ struct kvm_clock_data { > > > > > > > > > > > > #define KVM_XEN_HVM_CONFIG _IOW(KVMIO, 0x7a, struct > > > > > > > > > > > > kvm_xen_hvm_config) #define KVM_SET_CLOCK _IOW(KVMIO, > > > > > > 0x7b, struct kvm_clock_data) #define KVM_GET_CLOCK > > > > > > _IOR(KVMIO, 0x7c, struct kvm_clock_data) +#define > > > > > > KVM_ASSIGN_REG_MSIX_MMIO _IOW(KVMIO, 0x7d, \ > > > > > > + struct kvm_assigned_msix_mmio) > > > > > > > > > > > > /* Available with KVM_CAP_PIT_STATE2 */ > > > > > > #define KVM_GET_PIT2 _IOR(KVMIO, 0x9f, struct > > > > > > > > > > > > kvm_pit_state2) #define KVM_SET_PIT2 _IOW(KVMIO, > > > > > > 0xa0, struct kvm_pit_state2) @@ -802,7 +808,7 @@ struct > > > > > > kvm_assigned_msix_mmio { > > > > > > > > > > > > __u32 assigned_dev_id; > > > > > > __u64 base_addr; > > > > > > __u32 flags; > > > > > > > > > > > > - __u32 reserved[2]; > > > > > > + __u64 reserved; > > > > > > > > > > > > }; > > > > > > > > > > > > #endif /* __LINUX_KVM_H */ > > > > > > > > > > > > diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c > > > > > > index 5d2adc4..9573194 100644 > > > > > > --- a/virt/kvm/assigned-dev.c > > > > > > +++ b/virt/kvm/assigned-dev.c > > > > > > @@ -17,6 +17,8 @@ > > > > > > > > > > > > #include <linux/pci.h> > > > > > > #include <linux/interrupt.h> > > > > > > #include <linux/slab.h> > > > > > > > > > > > > +#include <linux/irqnr.h> > > > > > > + > > > > > > > > > > > > #include "irq.h" > > > > > > > > > > > > static struct kvm_assigned_dev_kernel > > > > > > *kvm_find_assigned_dev(struct > > > > > > > > > > > > list_head *head, @@ -169,6 +171,14 @@ static void > > > > > > deassign_host_irq(struct kvm *kvm, */ > > > > > > > > > > > > if (assigned_dev->irq_requested_type & KVM_DEV_IRQ_HOST_MSIX) { > > > > > > > > > > > > int i; > > > > > > > > > > > > +#ifdef __KVM_HAVE_MSIX > > > > > > + if (assigned_dev->msix_mmio_base) { > > > > > > + mutex_lock(&kvm->slots_lock); > > > > > > + kvm_io_bus_unregister_dev(kvm, KVM_MMIO_BUS, > > > > > > + &assigned_dev->msix_mmio_dev); > > > > > > + mutex_unlock(&kvm->slots_lock); > > > > > > + } > > > > > > +#endif > > > > > > > > > > > > for (i = 0; i < assigned_dev->entries_nr; i++) > > > > > > > > > > > > disable_irq_nosync(assigned_dev-> > > > > > > > > > > > > host_msix_entries[i].vector); > > > > > > > > > > > > @@ -318,6 +328,15 @@ static int > > > > > > assigned_device_enable_host_msix(struct kvm *kvm, goto err; > > > > > > > > > > > > } > > > > > > > > > > > > + if (dev->msix_mmio_base) { > > > > > > + mutex_lock(&kvm->slots_lock); > > > > > > + r = kvm_io_bus_register_dev(kvm, KVM_MMIO_BUS, > > > > > > + &dev->msix_mmio_dev); > > > > > > + mutex_unlock(&kvm->slots_lock); > > > > > > + if (r) > > > > > > + goto err; > > > > > > + } > > > > > > + > > > > > > > > > > > > return 0; > > > > > > > > > > > > err: > > > > > > for (i -= 1; i >= 0; i--) > > > > > > > > > > > > @@ -870,6 +889,31 @@ static const struct kvm_io_device_ops > > > > > > msix_mmio_ops = { .write = msix_mmio_write, > > > > > > > > > > > > }; > > > > > > > > > > > > +static int kvm_vm_ioctl_register_msix_mmio(struct kvm *kvm, > > > > > > + struct kvm_assigned_msix_mmio *msix_mmio) > > > > > > +{ > > > > > > + int r = 0; > > > > > > + struct kvm_assigned_dev_kernel *adev; > > > > > > + > > > > > > + mutex_lock(&kvm->lock); > > > > > > + adev = kvm_find_assigned_dev(&kvm->arch.assigned_dev_head, > > > > > > + msix_mmio->assigned_dev_id); > > > > > > + if (!adev) { > > > > > > + r = -EINVAL; > > > > > > + goto out; > > > > > > + } > > > > > > + if (msix_mmio->base_addr == 0) { > > > > > > + r = -EINVAL; > > > > > > + goto out; > > > > > > + } > > > > > > + adev->msix_mmio_base = msix_mmio->base_addr; > > > > > > + > > > > > > + kvm_iodevice_init(&adev->msix_mmio_dev, &msix_mmio_ops); > > > > > > +out: > > > > > > + mutex_unlock(&kvm->lock); > > > > > > + > > > > > > + return r; > > > > > > +} > > > > > > > > > > > > #endif > > > > > > > > > > > > long kvm_vm_ioctl_assigned_device(struct kvm *kvm, unsigned ioctl, > > > > > > > > > > > > @@ -982,6 +1026,22 @@ long kvm_vm_ioctl_assigned_device(struct kvm > > > > > > *kvm, unsigned ioctl, goto out; > > > > > > > > > > > > break; > > > > > > > > > > > > } > > > > > > > > > > > > + case KVM_ASSIGN_REG_MSIX_MMIO: { > > > > > > + struct kvm_assigned_msix_mmio msix_mmio; > > > > > > + > > > > > > + r = -EFAULT; > > > > > > + if (copy_from_user(&msix_mmio, argp, sizeof(msix_mmio))) > > > > > > + goto out; > > > > > > + > > > > > > + r = -EINVAL; > > > > > > + if (msix_mmio.flags != 0 || msix_mmio.reserved != 0) > > > > > > + goto out; > > > > > > + > > > > > > + r = kvm_vm_ioctl_register_msix_mmio(kvm, &msix_mmio); > > > > > > + if (r) > > > > > > + goto out; > > > > > > + break; > > > > > > + } > > > > > > > > > > > > #endif > > > > > > > > > > > > } > > > > > > > > > > > > out: -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html