On Mon, Jun 22, 2009 at 01:31:29PM -0400, Gregory Haskins wrote: > Michael S. Tsirkin wrote: > > On Mon, Jun 22, 2009 at 12:05:57PM -0400, Gregory Haskins wrote: > > > >> This patch fixes all known races in irqfd, and paves the way to restore > >> DEASSIGN support. For details of the eventfd races, please see the patch > >> presumably commited immediately prior to this one. > >> > >> In a nutshell, we use eventfd_kref_get/put() to properly manage the > >> lifetime of the underlying eventfd. We also use careful coordination > >> with our workqueue to ensure that all irqfd objects have terminated > >> before we allow kvm to shutdown. The logic used for shutdown walks > >> all open irqfds and releases them. This logic can be generalized in > >> the future to allow a subset of irqfds to be released, thus allowing > >> DEASSIGN support. > >> > >> Signed-off-by: Gregory Haskins <ghaskins@xxxxxxxxxx> > >> > > > > I think this patch is a shade too tricky. Some explanation why below. > > > > But I think irqfd_pop is a good idea. > > > > Yeah, next we can add something like "irqfd_remove(gsi)" in a similar > way to do DEASSIGN. > > > Here's an alternative design sketch: add a list of irqfds to be shutdown > > in kvm, and create a single-threaded workqueue. To kill an irqfd, move > > it from list of live irqfds to list of dead irqfds, then schedule work > > on a workqueue that walks this list and kills irqfds. > > > > Yeah, I actually thought of that too, and I think that will work. But > then I realized flush_schedule_work does the same thing and its much > less code. Perhaps it is also much less clear, too ;) At the very > least, you have made me realize I need to comment better. Not really, it's impossible to document all races one have thought about and avoided. > > > >> --- > >> > >> virt/kvm/eventfd.c | 144 ++++++++++++++++++++++++++++++++++++++++------------ > >> 1 files changed, 110 insertions(+), 34 deletions(-) > >> > >> diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c > >> index 9656027..67985cd 100644 > >> --- a/virt/kvm/eventfd.c > >> +++ b/virt/kvm/eventfd.c > >> @@ -28,6 +28,7 @@ > >> #include <linux/file.h> > >> #include <linux/list.h> > >> #include <linux/eventfd.h> > >> +#include <linux/kref.h> > >> > >> /* > >> * -------------------------------------------------------------------- > >> @@ -36,26 +37,68 @@ > >> * Credit goes to Avi Kivity for the original idea. > >> * -------------------------------------------------------------------- > >> */ > >> + > >> +enum { > >> + irqfd_flags_shutdown, > >> +}; > >> + > >> struct _irqfd { > >> struct kvm *kvm; > >> + struct kref *eventfd; > >> > > > > > > Yay, kref. > > > > > >> int gsi; > >> struct list_head list; > >> poll_table pt; > >> wait_queue_head_t *wqh; > >> wait_queue_t wait; > >> - struct work_struct inject; > >> + struct work_struct work; > >> + unsigned long flags; > >> > > > > Just make it "int shutdown"? > > > > Yep, that is probably fine but we will have to use an explicit wmb in > lieu of a set_bit operation. NBD. > > > > >> }; > >> > >> static void > >> -irqfd_inject(struct work_struct *work) > >> +irqfd_release(struct _irqfd *irqfd) > >> +{ > >> + eventfd_kref_put(irqfd->eventfd); > >> + kfree(irqfd); > >> +} > >> + > >> +static void > >> +irqfd_work(struct work_struct *work) > >> { > >> - struct _irqfd *irqfd = container_of(work, struct _irqfd, inject); > >> + struct _irqfd *irqfd = container_of(work, struct _irqfd, work); > >> struct kvm *kvm = irqfd->kvm; > >> > >> - mutex_lock(&kvm->irq_lock); > >> - kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 1); > >> - kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 0); > >> - mutex_unlock(&kvm->irq_lock); > >> + if (!test_bit(irqfd_flags_shutdown, &irqfd->flags)) { > >> > > > > Why is it safe to test this bit outside of any lock? > > > Because the ordering is guaranteed to set_bit(), schedule_work(). All > we need to do is make sure that the work-queue runs at least one more > time after the flag has been set. (Of course, I could have screwed up > too, but that was my rationale). > > > > >> + /* Inject an interrupt */ > >> + mutex_lock(&kvm->irq_lock); > >> + kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 1); > >> + kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 0); > >> + mutex_unlock(&kvm->irq_lock); > >> + } else { > >> > > > > > > Not much shared code here - create a separate showdown work struct? > > They are cheap ... > > > > We can't because we need to ensure that all inject-jobs complete before > release-jobs. Reading the work-queue code, it would be a deadlock for > the release-job to do a flush_work(inject-job). Therefore, both > workloads are encapsulated into a single job, and we ensure that the job > is launched at least one more time after the flag has been set. AFAIK schedule_work does not give you in-order guarantees - it's multithreaded. you will have to create a single-threaded workqueue if you want in order execution. > Of course, now that I wrote that, I realize it was clear-as-mud in the > code and needs some commenting ;) > > > > >> + /* shutdown the irqfd */ > >> + struct _irqfd *_irqfd = NULL; > >> + > >> + mutex_lock(&kvm->lock); > >> + > >> + if (!list_empty(&irqfd->list)) > >> + _irqfd = irqfd; > >> + > >> + if (_irqfd) > >> + list_del(&_irqfd->list); > >> + > >> + mutex_unlock(&kvm->lock); > >> + > >> + /* > >> + * If the item is not currently on the irqfds list, we know > >> + * we are running concurrently with the KVM side trying to > >> + * remove this item as well. > >> > > > > We do? How? As far as I can see list is only empty after it has been > > created. Generally, it would be better to either use a flag or use > > list_empty as an indication of going down, but not both. > > > > I think you are mis-reading that. list_empty(&irqfd->list) is the > individual irqfd list-item, not the kvm->irqfds list itself. This > conditional is telling us whether the irqfd in question is on or off the > list (its effectively an irqfd-specific flag), not whether the global > list is empty. Again, poor commenting on my part. Yes, but you do INIT_LIST_HEAD in a single place. Once you add irqfd->list to a list, it won't be empty until you init it again. > > > >> Since the KVM side should be > >> + * holding the reference now, and will block behind a > >> + * flush_work(), lets just let them do the release() for us > >> + */ > >> + if (!_irqfd) > >> + return; > >> + > >> + irqfd_release(_irqfd); > >> + } > >> } > >> > >> static int > >> @@ -65,25 +108,20 @@ irqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync, void *key) > >> unsigned long flags = (unsigned long)key; > >> > >> /* > >> - * Assume we will be called with interrupts disabled > >> + * called with interrupts disabled > >> */ > >> - if (flags & POLLIN) > >> - /* > >> - * Defer the IRQ injection until later since we need to > >> - * acquire the kvm->lock to do so. > >> - */ > >> - schedule_work(&irqfd->inject); > >> - > >> if (flags & POLLHUP) { > >> /* > >> - * for now, just remove ourselves from the list and let > >> - * the rest dangle. We will fix this up later once > >> - * the races in eventfd are fixed > >> + * ordering is important: shutdown flag must be visible > >> + * before we schedule > >> */ > >> __remove_wait_queue(irqfd->wqh, &irqfd->wait); > >> - irqfd->wqh = NULL; > >> + set_bit(irqfd_flags_shutdown, &irqfd->flags); > >> > > > > So what happens if a previously scheduled work runs on irqfd > > and sees this flag? > My original thought was "thats ok", but now that you mention it I am not > so sure. Ill give it some more thought because maybe you are on to > something. > > > And note that multiple works can run on irqfd > > in parallel. > > > > They can? I thought work-queue items were guaranteed to only schedule > once? If what you say is true, its broken, I agree, and Ill need to > revisit. Let me get back to you. > > > >> } > >> > >> + if (flags & (POLLHUP | POLLIN)) > >> + schedule_work(&irqfd->work); > >> + > >> return 0; > >> } > >> > >> @@ -102,6 +140,7 @@ kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags) > >> { > >> struct _irqfd *irqfd; > >> struct file *file = NULL; > >> + struct kref *kref = NULL; > >> int ret; > >> unsigned int events; > >> > >> @@ -112,7 +151,7 @@ kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags) > >> irqfd->kvm = kvm; > >> irqfd->gsi = gsi; > >> INIT_LIST_HEAD(&irqfd->list); > >> - INIT_WORK(&irqfd->inject, irqfd_inject); > >> + INIT_WORK(&irqfd->work, irqfd_work); > >> > >> file = eventfd_fget(fd); > >> if (IS_ERR(file)) { > >> @@ -133,11 +172,13 @@ kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags) > >> list_add_tail(&irqfd->list, &kvm->irqfds); > >> mutex_unlock(&kvm->lock); > >> > >> - /* > >> - * Check if there was an event already queued > >> - */ > >> - if (events & POLLIN) > >> - schedule_work(&irqfd->inject); > >> + kref = eventfd_kref_get(file); > >> + if (IS_ERR(file)) { > >> + ret = PTR_ERR(file); > >> + goto fail; > >> + } > >> + > >> + irqfd->eventfd = kref; > >> > >> /* > >> * do not drop the file until the irqfd is fully initialized, otherwise > >> @@ -145,9 +186,18 @@ kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags) > >> */ > >> fput(file); > >> > >> + /* > >> + * Check if there was an event already queued > >> + */ > >> > > > > This comment seems to confuse more that it clarifies: > > queued where? eventfd only counts... Just kill the comment? > > > > > non-zero values in eventfd are "queued" as a signal. This test just > checks if an interrupt was already injected before we registered. After have understood the code I see what you mean, but the comment wasn't helpful and is better left out. > >> + if (events & POLLIN) > >> + schedule_work(&irqfd->work); > >> + > >> return 0; > >> > >> fail: > >> + if (kref && !IS_ERR(kref)) > >> + eventfd_kref_put(kref); > >> + > >> if (file && !IS_ERR(file)) > >> fput(file); > >> > > > > let's add a couple more labels and avoid the kref/file check > > and the initialization above? > > > > I think that just makes it more confusing, personally. But I will give > it some thought. > > > > >> > >> @@ -161,21 +211,47 @@ kvm_irqfd_init(struct kvm *kvm) > >> INIT_LIST_HEAD(&kvm->irqfds); > >> } > >> > >> +static struct _irqfd * > >> +irqfd_pop(struct kvm *kvm) > >> +{ > >> + struct _irqfd *irqfd = NULL; > >> + > >> + mutex_lock(&kvm->lock); > >> + > >> + if (!list_empty(&kvm->irqfds)) { > >> + irqfd = list_first_entry(&kvm->irqfds, struct _irqfd, list); > >> + list_del(&irqfd->list); > >> + } > >> + > >> + mutex_unlock(&kvm->lock); > >> + > >> + return irqfd; > >> +} > >> + > >> void > >> kvm_irqfd_release(struct kvm *kvm) > >> { > >> - struct _irqfd *irqfd, *tmp; > >> + struct _irqfd *irqfd; > >> > >> - list_for_each_entry_safe(irqfd, tmp, &kvm->irqfds, list) { > >> - if (irqfd->wqh) > >> - remove_wait_queue(irqfd->wqh, &irqfd->wait); > >> + while ((irqfd = irqfd_pop(kvm))) { > >> > >> - flush_work(&irqfd->inject); > >> + remove_wait_queue(irqfd->wqh, &irqfd->wait); > >> > >> - mutex_lock(&kvm->lock); > >> - list_del(&irqfd->list); > >> - mutex_unlock(&kvm->lock); > >> + /* > >> + * We guarantee there will be no more notifications after > >> + * the remove_wait_queue returns. Now lets make sure we > >> + * synchronize behind any outstanding work items before > >> + * releasing the resources > >> + */ > >> + flush_work(&irqfd->work); > >> > >> - kfree(irqfd); > >> + irqfd_release(irqfd); > >> } > >> + > >> + /* > >> + * We need to wait in case there are any outstanding work-items > >> + * in flight that had already removed themselves from the list > >> + * prior to entry to this function > >> + */ > >> > > > > Looks scary. Why doesn't the flush above cover all cases? > > > > The path inside the while() is for when KVM wins the race and finds the > item in the list. It atomically removes it, and is responsible for > freeing it in a coordinated way. In this case, we must block with the > flush_work() before we can irqfd_release() so that we do not yank the > memory out from under a running work-item. > > The flush_scheduled_work() is for when eventfd wins the race and has > already removed itself from the list in the "shutdown" path in the > work-item. We want to make sure that kvm_irqfd_release() cannot return > until all work-items have exited to prevent something like the kvm.ko > module unloading while the work-item is still in flight. > Thanks Michael, > -Greg > > > >> + flush_scheduled_work(); > >> } > >> > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > Please read the FAQ at http://www.tux.org/lkml/ > > > > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html