On Mon, Nov 08, 2010 at 11:18:34AM +0800, Sheng Yang wrote: > On Friday 05 November 2010 21:27:42 Michael S. Tsirkin wrote: > > On Fri, Nov 05, 2010 at 06:53:50PM +0800, Sheng Yang wrote: > > > On Friday 05 November 2010 16:43:33 Michael S. Tsirkin wrote: > > > > On Fri, Nov 05, 2010 at 10:44:19AM +0800, Sheng Yang wrote: > > > > > > > +}; > > > > > > > + > > > > > > > +This ioctl would enable in-kernel MSI-X emulation, which would > > > > > > > handle MSI-X +mask bit in the kernel. > > > > > > > > > > > > What happens on repeated calls when it's already enabled? > > > > > > How does one disable after it's enabled? > > > > > > > > > > Suppose this should only be called once. But again would update the > > > > > MMIO base. > > > > > > > > So maybe add this all to documentation. > > > > > > > > > It > > > > > would be disabled along with device deassignment. > > > > > > > > So what are you going to do when guest enables and later disables MSIX? > > > > Disable assignment completely? > > > > > > This device goes with PCI resources allocation, rather than IRQ > > > assignment. > > > > I see. I guess we can live with it, but it seems tied to a specific > > mode of use. Would be better to support enabling together with msix too, > > this requires an ability to disable. > > > > > > > We enable it explicitly because > > > > > not all devices have MSI-X capability. > > > > > > > > > > > > + > > > > > > > > > > > > This is very specialized for assigned devices. I would like an > > > > > > interface not tied to assigned devices explicitly > > > > > > (even if the implementation at the moment only works > > > > > > for assigned devices, I can patch that in later). E.g. vhost-net > > > > > > would benefit from such an extension too. Maybe tie this to a > > > > > > special GSI and this GSI can be an assigned device or not? > > > > > > > > > > You're talking about KVM_ASSIGN_GET_MSIX_ENTRY? > > > > > > > > Yes but also, and more importantly, about KVM_ASSIGN_REG_MSIX_MMIO > > > > and kvm_assigned_msix_mmio really. > > > > > > > > > We can't tie it to GSI, because we > > > > > should also cover entries without GSI. The entry number should be > > > > > fine for all kinds of devices using MSI-X. > > > > > > > > I don't completely understand: entries without GSI > > > > never need to inject an interrupt. Why do we care > > > > about them? Let's pass such accesses to userspace. > > > > > > > > In any case, I think MSIX GSIs are inexpensive (we never search them > > > > all), so we could create GSIs for all entries if we wanted to. > > > > > > All mask bit accessing controlled by kernel, in order to keep consistent. > > > > Well I think passing whatever we can't handle to userspace makes > > complete sense. We do this for address/data writes after all. > > If we wanted to do it *all* in kernel we would handle > > address/data there too. I think Avi suggested this at some point. > > I think Avi's point was we handle mask bit half in the kernel and half in the > userspace, which makes API complex. I agree with this so I moved all mask bit > handling to kernel. > > Data/address is another issue, due to QEmu own the routing table. We can change it > in the future, but this is not related to this patch. > > > > The main point is msix mmio BAR handling is completely unrelated > > to the assigned device. Emulated devices have an msix BAR as well. > > So tying it to an assigned devices is suboptimal, we'll need to > > grow yet another interface for these. > > If you only means the name, we can change that. I mean it shouldn't use the assigned device id. Use some other index: I suggested adding a new GSI type for this, but can be something else if you prefer. > But I think all MSI-X entries > would have entry number, regardless of if it got GSI number. Keep entry number as > parameter makes more sense to me. > > And still, the patch's only current user is assigned device. I am glad if it can > cover the other future case along with it, but that's not the primary target of > this patch. It's a problem I think. Stop thinking about the patch for a moment and think about kvm as a whole. We know we will want such an interface for emulated and PV devices. What then? Add another interface and a bunch of code to support it? Point being, since userspace interfaces need to be supported forever, they should be generic if possible. > > > > > > > I am not sure about what PV one would looks like. > > > > > I use kvm_assigned_msix_entry > > > > > just because it can be easily reused. > > > > > > > > Not only PV, emulated devices too. OK let's try to figure out: > > > > > > > > What happens with userspace is, we inject interrupts either through > > > > ioctls or eventfds. Avoiding an exit to userspace on MSIX > > > > access is beneficial in exactly the same way. > > > > > > > > We could have an ioctl to pass in the table addresses, and mapping > > > > table to relevant GSIs. For example, add kvm_msix_mmio with table > > > > start/end, also specify which GSIs are covered. > > > > Maybe ask that userspace allocate all GSIs for a table consequitively? > > > > Or add kvm_irq_routing_msix which is same as msi but > > > > also has the mmio addresses? Bonus points for avoiding the need > > > > to scan all table on each write. For example, don't scan at all, > > > > instead just set a bit in KVM, and set irq entry on the next > > > > interrupt only: we have KVM_MAX_MSIX_PER_DEV as 256 so maybe a small 32 > > > > byte bitmap would be enough for this, avoding a linear scan. > > > > > > > > When we pass in kvm_msix_mmio, kvm would start registering masked > > > > state. > > > > > > > > > > > +4.48 KVM_ASSIGN_GET_MSIX_ENTRY > > > > > > > + > > > > > > > +Capability: KVM_CAP_DEVICE_MSIX_MASK > > > > > > > +Architectures: x86 > > > > > > > +Type: vm ioctl > > > > > > > +Parameters: struct kvm_assigned_msix_entry (in and out) > > > > > > > +Returns: 0 on success, !0 on error > > > > > > > + > > > > > > > +struct kvm_assigned_msix_entry { > > > > > > > + /* Assigned device's ID */ > > > > > > > + __u32 assigned_dev_id; > > > > > > > + /* Ignored */ > > > > > > > + __u32 gsi; > > > > > > > + /* The index of entry in the MSI-X table */ > > > > > > > + __u16 entry; > > > > > > > + /* Querying flags and returning status */ > > > > > > > + __u16 flags; > > > > > > > + /* Must be 0 */ > > > > > > > + __u16 padding[2]; > > > > > > > +}; > > > > > > > + > > > > > > > +This ioctl would allow userspace to get the status of one > > > > > > > specific MSI-X +entry. Currently we support mask bit status > > > > > > > querying. + > > > > > > > > > > > > > > 5. The kvm_run structure > > > > > > > > > > > > > > Application code obtains a pointer to the kvm_run structure by > > > > > > > > > > > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > > > > > > > index f3f86b2..8fd5121 100644 > > > > > > > --- a/arch/x86/kvm/x86.c > > > > > > > +++ b/arch/x86/kvm/x86.c > > > > > > > @@ -1926,6 +1926,7 @@ int kvm_dev_ioctl_check_extension(long ext) > > > > > > > > > > > > > > case KVM_CAP_DEBUGREGS: > > > > > > > case KVM_CAP_X86_ROBUST_SINGLESTEP: > > > > > > > > > > > > > > case KVM_CAP_XSAVE: > > > > > > > + case KVM_CAP_DEVICE_MSIX_MASK: > > > > > > > r = 1; > > > > > > > break; > > > > > > > > > > > > > > case KVM_CAP_COALESCED_MMIO: > > > > > > > diff --git a/include/linux/kvm.h b/include/linux/kvm.h > > > > > > > index 919ae53..eafafb1 100644 > > > > > > > --- a/include/linux/kvm.h > > > > > > > +++ b/include/linux/kvm.h > > > > > > > @@ -540,6 +540,9 @@ struct kvm_ppc_pvinfo { > > > > > > > > > > > > > > #endif > > > > > > > #define KVM_CAP_PPC_GET_PVINFO 57 > > > > > > > #define KVM_CAP_PPC_IRQ_LEVEL 58 > > > > > > > > > > > > > > +#ifdef __KVM_HAVE_MSIX > > > > > > > +#define KVM_CAP_DEVICE_MSIX_MASK 59 > > > > > > > +#endif > > > > > > > > > > > > We seem to have these HAVE macros all over. > > > > > > Avi, what's the idea? Let's drop them? > > > > > > > > > > Um? This kind of marcos are used here to avoid vendor specific string > > > > > in the header file. > > > > > > > > Am I the only one that dislikes the ifdefs we have all over this code > > > > then? Rest of linux tries to keep ifdefs local by stubbing > > > > functionality out. > > > > > > > > > > > + BUG_ON(adev->msix_mmio_base == 0); > > > > > > > > > > > > Below I see > > > > > > > > > > > > adev->msix_mmio_base = msix_mmio->base_addr; > > > > > > > > > > > > which comes from userspace. BUG_ON is an unfriendly way to > > > > > > tell user about a bug in qemu. > > > > > > > > > > > > Anyway, I don't think we should special-case 0 gpa. > > > > > > It's up to the user where to base the table. > > > > > > Use a separate flag to signal that table was initialized. > > > > > > > > > > Base_addr can't be 0 in any case. Only the unimplemented BAR are > > > > > hardwired to zero. > > > > > > > > No, the PCI spec does not say so. Check it out: the unimplemented BAR > > > > is hardwired to zero but implemented can be set to 0 as well. > > > > > > OK, OK, OK, I would add one more flag. > > > > > > > > Also if we get here with base_addr = 0, it is surely a bug. > > > > > > > > > > > > + > > > > > > > + return -EINVAL; > > > > > > > +} > > > > > > > + > > > > > > > +static int msix_mmio_read(struct kvm_io_device *this, gpa_t > > > > > > > addr, int len, + void *val) > > > > > > > +{ > > > > > > > + struct kvm_assigned_dev_kernel *adev = > > > > > > > + container_of(this, struct kvm_assigned_dev_kernel, > > > > > > > + msix_mmio_dev); > > > > > > > + int idx, r = 0; > > > > > > > + u32 entry[4]; > > > > > > > + struct kvm_kernel_irq_routing_entry e; > > > > > > > + > > > > > > > + mutex_lock(&adev->kvm->lock); > > > > > > > + if (!msix_mmio_in_range(adev, addr, len)) { > > > > > > > + r = -EOPNOTSUPP; > > > > > > > + goto out; > > > > > > > + } > > > > > > > + if ((addr & 0x3) || len != 4) > > > > > > > + goto out; > > > > > > > + > > > > > > > + idx = msix_get_enabled_idx(adev, addr, len); > > > > > > > + if (idx < 0) { > > > > > > > + idx = (addr - adev->msix_mmio_base) / PCI_MSIX_ENTRY_SIZE; > > > > > > > + if ((addr % PCI_MSIX_ENTRY_SIZE) == > > > > > > > + PCI_MSIX_ENTRY_VECTOR_CTRL) > > > > > > > + *(unsigned long *)val = > > > > > > > + test_bit(idx, adev->msix_mask_bitmap) ? > > > > > > > + PCI_MSIX_ENTRY_CTRL_MASKBIT : 0; > > > > > > > > > > > > I suspect this is wrong for big endian platforms: PCI returns > > > > > > all values in little endian format. > > > > > > > > > > Let big endian platform take care of it later... There is only x86 > > > > > has __KVM_HAVE_MSIX now. > > > > > > > > Well it's easier to make the code correct when you write it than later. > > > > > > Go back to me if someone want to implement MSI device assignment on > > > big-endian machine. > > > > I think it's more elegant to make the code correct the first time around. > > What are we arguing about? A couple of cpu_to_le macros? > > Sorry, I am really not familiar with big endian machine and related macros. Also > lacks of way to test it. Would give it a try. Actually, there is a way to do some limited check on a nx86 pc: just tag the entry appropriately: __le32 entry[4]; (Hmm, isn't there some macro instead of 4?) And then make sparse not complain about code (make C=1). > > > > > > > > > + else > > > > > > > + r = -EOPNOTSUPP; > > > > > > > + goto out; > > > > > > > + } > > > > > > > + > > > > > > > + r = kvm_get_irq_routing_entry(adev->kvm, > > > > > > > + adev->guest_msix_entries[idx].vector, &e); > > > > > > > + if (r || e.type != KVM_IRQ_ROUTING_MSI) { > > > > > > > + printk(KERN_WARNING "KVM: Wrong MSI-X routing entry! " > > > > > > > + "idx %d, addr 0x%llx, len %d\n", idx, addr, len); > > > > > > > > > > > > Can a malicious userspace trigger this intentionally? > > > > > > > > > > No. > > > > > > > > Here's a scenario: bind an MSIX routing entry to device, then later > > > > change its type to regular interrupt. Fills the log with warnings, > > > > doesn't it? > > > > > > Does QEmu suppose to discard the old routing? > > > > It does not matter. We can't trust userspace to do the right thing. > > OK, let's eliminate this. Yea. debug or something. Also, since it's a read, we can pass it up to userspace which will maybe make it easier to debug. > -- > regards > Yang, Sheng -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html