On Mon, Jul 30, 2012 at 06:26:31PM -0600, Alex Williamson wrote: > On Tue, 2012-07-31 at 03:01 +0300, Michael S. Tsirkin wrote: > > On Mon, Jul 30, 2012 at 10:22:10AM -0600, Alex Williamson wrote: > > > On Sun, 2012-07-29 at 17:54 +0300, Michael S. Tsirkin wrote: > > > > On Tue, Jul 24, 2012 at 02:43:22PM -0600, Alex Williamson wrote: > > > > > This new ioctl enables an eventfd to be triggered when an EOI is > > > > > written for a specified irqchip pin. The first user of this will > > > > > be external device assignment through VFIO, using a level irqfd > > > > > for asserting a PCI INTx interrupt and this interface for de-assert > > > > > and notification once the interrupt is serviced. > > > > > > > > > > Here we make use of the reference counting of the _irq_source > > > > > object allowing us to share it with an irqfd and cleanup regardless > > > > > of the release order. > > > > > > > > > > Signed-off-by: Alex Williamson <alex.williamson@xxxxxxxxxx> > > > > > > > > > --- > > > > > > > > > > Documentation/virtual/kvm/api.txt | 21 ++ > > > > > arch/x86/kvm/x86.c | 2 > > > > > include/linux/kvm.h | 15 ++ > > > > > include/linux/kvm_host.h | 13 + > > > > > virt/kvm/eventfd.c | 336 +++++++++++++++++++++++++++++++++++++ > > > > > virt/kvm/kvm_main.c | 11 + > > > > > 6 files changed, 398 insertions(+) > > > > > > > > > > diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt > > > > > index 3911e62..8cd6b36 100644 > > > > > --- a/Documentation/virtual/kvm/api.txt > > > > > +++ b/Documentation/virtual/kvm/api.txt > > > > > @@ -1989,6 +1989,27 @@ return the hash table order in the parameter. (If the guest is using > > > > > the virtualized real-mode area (VRMA) facility, the kernel will > > > > > re-create the VMRA HPTEs on the next KVM_RUN of any vcpu.) > > > > > > > > > > +4.77 KVM_EOIFD > > > > > + > > > > > +Capability: KVM_CAP_EOIFD > > > > > +Architectures: x86 > > > > > +Type: vm ioctl > > > > > +Parameters: struct kvm_eoifd (in) > > > > > +Returns: 0 on success, < 0 on error > > > > > + > > > > > +KVM_EOIFD allows userspace to receive interrupt EOI notification > > > > > +through an eventfd. > > > > > > > > I thought about it some more, and I think it should be renamed to an > > > > interrupt ack notification than eoi notification. > > > > For example, consider userspace that uses threaded interrupts. > > > > Currently what will happen is each interrupt will be injected > > > > twice, since on eoi device is still asserting it. > > > > > > I don't follow, why is userspace writing an eoi to the ioapic if it > > > hasn't handled the interrupt > > > > It has handled it - it disabled the hardware interrupt. > > So it's not injected twice, it's held pending at the ioapic the second > time, just like hardware. It is not like hardware at all. in hardware there is no overhead here you interrupot the guest to run handler in host. > Maybe there's a future optimization there, > but I don't think it's appropriate at this time. Yes. But to make it *possible* in future we must remove the requirement to signal fd immediately on EOI. So rename it ackfd. > > > and why wouldn't the same happen on bare > > > metal? > > > > on bare metal level does not matter as long as interrupt > > is disabled. > > > > > > One fix would be to delay event until interrupt is re-enabled. > > > > Now I am not asking you to fix this immediately, > > > > but I think we should make the interface generic by > > > > saying we report an ack to userspace and not specifically EOI. > > > > > > Using the word "delay" in the context of interrupt delivery raises all > > > sorts of red flags for me, but I really don't understand your argument. > > > > I am saying it's an "ack" of interrupt userspace cares about. > > The fact it is done by EOI is an implementation detail. > > The implementation is how an EOI is generated on an ioapic, not that an > EOI exists. How do I read a hardware spec and figure out what "ack of > interrupt" means? It just means it will be called after guest has completed handling interrupt. How we detect that is our problem. > > > > > kvm_eoifd.fd specifies the eventfd used for > > > > > +notification. KVM_EOIFD_FLAG_DEASSIGN is used to de-assign an eoifd > > > > > +once assigned. KVM_EOIFD also requires additional bits set in > > > > > +kvm_eoifd.flags to bind to the proper interrupt line. The > > > > > +KVM_EOIFD_FLAG_LEVEL_IRQFD indicates that kvm_eoifd.key is provided > > > > > +and is a key from a level triggered interrupt (configured from > > > > > +KVM_IRQFD using KVM_IRQFD_FLAG_LEVEL). The EOI notification is bound > > > > > +to the same GSI and irqchip input as the irqfd. Both kvm_eoifd.key > > > > > +and KVM_EOIFD_FLAG_LEVEL_IRQFD must be specified on assignment and > > > > > +de-assignment of KVM_EOIFD. A level irqfd may only be bound to a > > > > > +single eoifd. KVM_CAP_EOIFD_LEVEL_IRQFD indicates support of > > > > > +KVM_EOIFD_FLAG_LEVEL_IRQFD. > > > > > > > > > > > > > Hmm returning the key means we'll need to keep refcounting for source > > > > IDs around forever. I liked passing the fd better: make implementation > > > > match interface and not the other way around. > > > > > > False, a source ID has a finite lifecycle. The fd approach was broken. > > > Holding the irqfd context imposed too many dependencies between eoifd > > > and irqfd necessitating things like one interface disabling another. I > > > thoroughly disagree with that approach. > > > > You keep saying this but it is still true: once irqfd > > is closed eoifd does not get any more interrupts. > > How does that matter? Well if it does not get events it is disabled. so you have one ifc disabling another, anyway. > > > > > 5. The kvm_run structure > > > > > ------------------------ > > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > > > > > index 9ded39d..8f3164e 100644 > > > > > --- a/arch/x86/kvm/x86.c > > > > > +++ b/arch/x86/kvm/x86.c > > > > > @@ -2171,6 +2171,8 @@ int kvm_dev_ioctl_check_extension(long ext) > > > > > case KVM_CAP_PCI_2_3: > > > > > case KVM_CAP_KVMCLOCK_CTRL: > > > > > case KVM_CAP_IRQFD_LEVEL: > > > > > + case KVM_CAP_EOIFD: > > > > > + case KVM_CAP_EOIFD_LEVEL_IRQFD: > > > > > r = 1; > > > > > break; > > > > > case KVM_CAP_COALESCED_MMIO: > > > > > diff --git a/include/linux/kvm.h b/include/linux/kvm.h > > > > > index b2e6e4f..effb916 100644 > > > > > --- a/include/linux/kvm.h > > > > > +++ b/include/linux/kvm.h > > > > > @@ -619,6 +619,8 @@ struct kvm_ppc_smmu_info { > > > > > #define KVM_CAP_S390_COW 79 > > > > > #define KVM_CAP_PPC_ALLOC_HTAB 80 > > > > > #define KVM_CAP_IRQFD_LEVEL 81 > > > > > +#define KVM_CAP_EOIFD 82 > > > > > +#define KVM_CAP_EOIFD_LEVEL_IRQFD 83 > > > > > > > > > > #ifdef KVM_CAP_IRQ_ROUTING > > > > > > > > > > @@ -694,6 +696,17 @@ struct kvm_irqfd { > > > > > __u8 pad[20]; > > > > > }; > > > > > > > > > > +#define KVM_EOIFD_FLAG_DEASSIGN (1 << 0) > > > > > +/* Available with KVM_CAP_EOIFD_LEVEL_IRQFD */ > > > > > +#define KVM_EOIFD_FLAG_LEVEL_IRQFD (1 << 1) > > > > > + > > > > > +struct kvm_eoifd { > > > > > + __u32 fd; > > > > > + __u32 flags; > > > > > + __u32 key; > > > > > + __u8 pad[20]; > > > > > +}; > > > > > + > > > > > struct kvm_clock_data { > > > > > __u64 clock; > > > > > __u32 flags; > > > > > @@ -834,6 +847,8 @@ struct kvm_s390_ucas_mapping { > > > > > #define KVM_PPC_GET_SMMU_INFO _IOR(KVMIO, 0xa6, struct kvm_ppc_smmu_info) > > > > > /* Available with KVM_CAP_PPC_ALLOC_HTAB */ > > > > > #define KVM_PPC_ALLOCATE_HTAB _IOWR(KVMIO, 0xa7, __u32) > > > > > +/* Available with KVM_CAP_EOIFD */ > > > > > +#define KVM_EOIFD _IOW(KVMIO, 0xa8, struct kvm_eoifd) > > > > > > > > > > /* > > > > > * ioctls for vcpu fds > > > > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > > > > > index c73f071..01e72a6 100644 > > > > > --- a/include/linux/kvm_host.h > > > > > +++ b/include/linux/kvm_host.h > > > > > @@ -289,6 +289,10 @@ struct kvm { > > > > > struct mutex lock; > > > > > struct list_head items; > > > > > } irqsources; > > > > > + struct { > > > > > + spinlock_t lock; > > > > > + struct list_head items; > > > > > + } eoifds; > > > > > #endif > > > > > struct kvm_vm_stat stat; > > > > > struct kvm_arch arch; > > > > > @@ -832,6 +836,8 @@ int kvm_irqfd(struct kvm *kvm, struct kvm_irqfd *args); > > > > > void kvm_irqfd_release(struct kvm *kvm); > > > > > void kvm_irq_routing_update(struct kvm *, struct kvm_irq_routing_table *); > > > > > int kvm_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args); > > > > > +int kvm_eoifd(struct kvm *kvm, struct kvm_eoifd *args); > > > > > +void kvm_eoifd_release(struct kvm *kvm); > > > > > > > > > > #else > > > > > > > > > > @@ -857,6 +863,13 @@ static inline int kvm_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args) > > > > > return -ENOSYS; > > > > > } > > > > > > > > > > +static inline int kvm_eoifd(struct kvm *kvm, struct kvm_eoifd *args) > > > > > +{ > > > > > + return -ENOSYS; > > > > > +} > > > > > + > > > > > +static inline void kvm_eoifd_release(struct kvm *kvm) {} > > > > > + > > > > > #endif /* CONFIG_HAVE_KVM_EVENTFD */ > > > > > > > > > > #ifdef CONFIG_KVM_APIC_ARCHITECTURE > > > > > diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c > > > > > index 878cb52..3aa2d62 100644 > > > > > --- a/virt/kvm/eventfd.c > > > > > +++ b/virt/kvm/eventfd.c > > > > > @@ -95,6 +95,25 @@ static struct _irq_source *_irq_source_alloc(struct kvm *kvm, int gsi) > > > > > return source; > > > > > } > > > > > > > > > > +static struct _irq_source *_irq_source_get_from_key(struct kvm *kvm, int key) > > > > > +{ > > > > > + struct _irq_source *tmp, *source = ERR_PTR(-ENOENT); > > > > > + > > > > > + mutex_lock(&kvm->irqsources.lock); > > > > > + > > > > > + list_for_each_entry(tmp, &kvm->irqsources.items, list) { > > > > > + if (tmp->id == key) { > > > > > + source = tmp; > > > > > + kref_get(&source->kref); > > > > > + break; > > > > > + } > > > > > + } > > > > > + > > > > > + mutex_unlock(&kvm->irqsources.lock); > > > > > + > > > > > + return source; > > > > > +} > > > > > + > > > > > /* > > > > > * -------------------------------------------------------------------- > > > > > * irqfd: Allows an fd to be used to inject an interrupt to the guest > > > > > @@ -406,6 +425,8 @@ kvm_eventfd_init(struct kvm *kvm) > > > > > INIT_LIST_HEAD(&kvm->ioeventfds); > > > > > mutex_init(&kvm->irqsources.lock); > > > > > INIT_LIST_HEAD(&kvm->irqsources.items); > > > > > + spin_lock_init(&kvm->eoifds.lock); > > > > > + INIT_LIST_HEAD(&kvm->eoifds.items); > > > > > } > > > > > > > > > > /* > > > > > @@ -772,3 +793,318 @@ kvm_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args) > > > > > > > > > > return kvm_assign_ioeventfd(kvm, args); > > > > > } > > > > > + > > > > > +/* > > > > > + * -------------------------------------------------------------------- > > > > > + * eoifd: Translate KVM APIC/IOAPIC EOI into eventfd signal. > > > > > + * > > > > > + * userspace can register with an eventfd for receiving > > > > > + * notification when an EOI occurs. > > > > > + * -------------------------------------------------------------------- > > > > > + */ > > > > > + > > > > > +struct _eoifd { > > > > > + /* eventfd triggered on EOI */ > > > > > + struct eventfd_ctx *eventfd; > > > > > + /* irq source ID de-asserted on EOI */ > > > > > + struct _irq_source *source; > > > > > + wait_queue_t wait; > > > > > + /* EOI notification from KVM */ > > > > > + struct kvm_irq_ack_notifier notifier; > > > > > + struct list_head list; > > > > > + poll_table pt; > > > > > + struct work_struct shutdown; > > > > > +}; > > > > > + > > > > > +/* Called under eoifds.lock */ > > > > > +static void eoifd_shutdown(struct work_struct *work) > > > > > +{ > > > > > + struct _eoifd *eoifd = container_of(work, struct _eoifd, shutdown); > > > > > + struct kvm *kvm = eoifd->source->kvm; > > > > > + u64 cnt; > > > > > + > > > > > + /* > > > > > + * Stop EOI signaling > > > > > + */ > > > > > + kvm_unregister_irq_ack_notifier(kvm, &eoifd->notifier); > > > > > + > > > > > + /* > > > > > + * Synchronize with the wait-queue and unhook ourselves to prevent > > > > > + * further events. > > > > > + */ > > > > > + eventfd_ctx_remove_wait_queue(eoifd->eventfd, &eoifd->wait, &cnt); > > > > > + > > > > > + /* > > > > > + * Release resources > > > > > + */ > > > > > + eventfd_ctx_put(eoifd->eventfd); > > > > > + _irq_source_put(eoifd->source); > > > > > + kfree(eoifd); > > > > > +} > > > > > + > > > > > +/* assumes kvm->eoifds.lock is held */ > > > > > +static bool eoifd_is_active(struct _eoifd *eoifd) > > > > > +{ > > > > > + return list_empty(&eoifd->list) ? false : true; > > > > > +} > > > > > + > > > > > +/* > > > > > + * Mark the eoifd as inactive and schedule it for removal > > > > > + * > > > > > + * assumes kvm->eoifds.lock is held > > > > > + */ > > > > > +static void eoifd_deactivate(struct _eoifd *eoifd) > > > > > +{ > > > > > + BUG_ON(!eoifd_is_active(eoifd)); > > > > > + > > > > > + list_del_init(&eoifd->list); > > > > > + > > > > > + queue_work(irqfd_cleanup_wq, &eoifd->shutdown); > > > > > +} > > > > > + > > > > > +/* > > > > > + * Called with wqh->lock held and interrupts disabled > > > > > + */ > > > > > +static int eoifd_wakeup(wait_queue_t *wait, unsigned mode, int sync, void *key) > > > > > +{ > > > > > + unsigned long flags = (unsigned long)key; > > > > > + > > > > > + if (unlikely(flags & POLLHUP)) { > > > > > + /* The eventfd is closing, detach from KVM */ > > > > > + struct _eoifd *eoifd = container_of(wait, struct _eoifd, wait); > > > > > + struct kvm *kvm = eoifd->source->kvm; > > > > > + unsigned long flags; > > > > > + > > > > > + spin_lock_irqsave(&kvm->eoifds.lock, flags); > > > > > + > > > > > + /* > > > > > + * We must check if someone deactivated the eoifd before > > > > > + * we could acquire the eoifds.lock since the item is > > > > > + * deactivated from the KVM side before it is unhooked from > > > > > + * the wait-queue. If it is already deactivated, we can > > > > > + * simply return knowing the other side will cleanup for us. > > > > > + * We cannot race against the eoifd going away since the > > > > > + * other side is required to acquire wqh->lock, which we hold > > > > > + */ > > > > > + if (eoifd_is_active(eoifd)) > > > > > + eoifd_deactivate(eoifd); > > > > > + > > > > > + spin_unlock_irqrestore(&kvm->eoifds.lock, flags); > > > > > + } > > > > > + > > > > > + return 0; > > > > > +} > > > > > > > > Looks like there is a bug here: if I close irqfd, then close eoifd, > > > > the key is not immediately released so an attempt to create > > > > an irqfd can fail to get the source id. > > > > > > Both irqfd and eoifd use the same workqueue for releasing objects and > > > both flush on assign. > > > > > > > Maybe we should simply document that userspace should deassign > > > > eoifd before closing it? This is what we do for ioeventfd. > > > > If we do this, the whole polling code can go away completely. > > > > > > You're again ignoring the close problem. We cannot document around an > > > impossible requirement that fds are always deassigned before close. > > > > Well userspace can easily call a deassign ioctl. Why is it so important > > that deassign is not required? > > Because everything allocated through a file descriptor, specific to that > file descriptor, should be freed when the file descriptor is closed. > That's what people expect. That's what documentation is for. > > > IMHO ioeventfd is broken here and I don't wish to emulate it's behavior. > > > > So fix ioeventfd first. Making eoifd and ioeventfd behave differently does not > > make sense they are very similar. > > One at a time. eoifd and ioeventfd have different requirements. > ioeventfd is just wasting memory, eoifd can potentially exhaust irq > source IDs. Besides, you still defend ioeventfd as correct. same as eoifd. > > > > > + > > > > > +static void eoifd_ptable_queue_proc(struct file *file, wait_queue_head_t *wqh, > > > > > + poll_table *pt) > > > > > +{ > > > > > + struct _eoifd *eoifd = container_of(pt, struct _eoifd, pt); > > > > > + add_wait_queue(wqh, &eoifd->wait); > > > > > +} > > > > > + > > > > > +/* > > > > > + * This function is called as the kvm VM fd is being released. Shutdown all > > > > > + * eoifds that still remain open > > > > > + */ > > > > > +void kvm_eoifd_release(struct kvm *kvm) > > > > > +{ > > > > > + struct _eoifd *tmp, *eoifd; > > > > > + > > > > > + spin_lock_irq(&kvm->eoifds.lock); > > > > > + > > > > > + list_for_each_entry_safe(eoifd, tmp, &kvm->eoifds.items, list) > > > > > + eoifd_deactivate(eoifd); > > > > > + > > > > > + spin_unlock_irq(&kvm->eoifds.lock); > > > > > + > > > > > + flush_workqueue(irqfd_cleanup_wq); > > > > > +} > > > > > + > > > > > +static void eoifd_event(struct kvm_irq_ack_notifier *notifier) > > > > > +{ > > > > > + struct _eoifd *eoifd; > > > > > + > > > > > + eoifd = container_of(notifier, struct _eoifd, notifier); > > > > > + > > > > > + if (unlikely(!eoifd->source)) > > > > > + return; > > > > > + > > > > > + /* > > > > > + * De-assert and send EOI, user needs to re-assert if > > > > > + * device still requires service. > > > > > + */ > > > > > > > > I'm not sure why did you drop filtering by source id. > > > > This means userspace gets events even if it did not send an interrupt. > > > > So > > > > 1. Should be documented that you can get spurious events > > > > 2. when an interrupt is shared with an emulated device, > > > > and said device uses EOI, this will not > > > > perform well as we will wake up userspace on each EOI. > > > > 3. Just sharing interrupt with virtio means we are polling > > > > assigned device on each virtio interrupt. > > > > > > Didn't we just agree after v5 that filtering requires a spinlock around > > > around calling kvm_irq_set this is already the case with your patchset. to avoid this, I am working on caching for interrupts, when ready you should probably rebase on top of that. > or at least a new interface to setting irqs > > > that allows us to see the current assertion state and that neither of > > > those seem to be worth the effort for level irqs? That's why I dropped > > > it. Interrupts always have to support spurious events. The comment > > > immediately above indicates this. Legacy interrupts, especially shared > > > legacy interrupts should not be our primary performance path. VFIO has > > > a very efficient path for handling spurious EOIs. > > > > But it will not help that vfio does this efficiently if userspace > > is woken up. You need to make it efficient for userspace consumers. > > Otherwise it's a vfio specific interface. > > Does this effect the design of this interface or is this a potential > future optimization? > Not interface, implementation. We just need to make it fast for all users not just inkernel ones. -- MST -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html