On Wed, Jan 06, 2010 at 11:25:40PM -0800, Davide Libenzi wrote: > On Thu, 7 Jan 2010, Michael S. Tsirkin wrote: > > > OK. What I think we need is a way to remove ourselves from the eventfd > > wait queue and clear the counter atomically. > > > > We currently do > > remove_wait_queue(irqfd->wqh, &irqfd->wait); > > where wqh saves the eventfd wait queue head. > > You do a remove_wait_queue() from inside a callback wakeup on the same > wait queue head? > No, not from callback, in ioctl context. > > If we do this before proposed eventfd_read_ctx, we can lose events. > > If we do this after, we can get spurious events. > > > > An unlocked read is one way to fix this. > > You posted one line of code and a two lines analysis of the issue. Can you > be a little bit more verbose and show me more code, so that I can actually > see what is going on? > > > - Davide Sure, I was trying to be as brief as possible, here's a detailed summary. Description of the system (MSI emulation in KVM): KVM supports an ioctl to assign/deassign an eventfd file to interrupt message in guest OS. When this eventfd is signalled, interrupt message is sent. This assignment is done from qemu system emulator. eventfd is signalled from device emulation in another thread in userspace or from kernel, which talks with guest OS through another eventfd and shared memory (possibility of out of process was discussed but never got implemented yet). Note: it's okay to delay messages from correctness point of view, but generally this is latency-sensitive path. If multiple identical messages are requested, it's okay to send a single last message, but missing a message altogether causes deadlocks. Sending a message when none were requested might in theory cause crashes, in practice doing this causes performance degradation. Another KVM feature is interrupt masking: guest OS requests that we stop sending some interrupt message, possibly modified mapping and re-enables this message. This needs to be done without involving the device that might keep requesting events: while masked, message is marked "pending", and guest might test the pending status. We can implement masking in system emulator in userspace, by using assign/deassign ioctls: when message is masked, we simply deassign all eventfd, and when it is unmasked, we assign them back. Here's some code to illustrate how this all works: assign/deassign code in kernel looks like the following: this is called to unmask interrupt static int kvm_irqfd_assign(struct kvm *kvm, int fd, int gsi) { struct _irqfd *irqfd, *tmp; struct file *file = NULL; struct eventfd_ctx *eventfd = NULL; int ret; unsigned int events; irqfd = kzalloc(sizeof(*irqfd), GFP_KERNEL); ... file = eventfd_fget(fd); if (IS_ERR(file)) { ret = PTR_ERR(file); goto fail; } eventfd = eventfd_ctx_fileget(file); if (IS_ERR(eventfd)) { ret = PTR_ERR(eventfd); goto fail; } irqfd->eventfd = eventfd; /* * Install our own custom wake-up handling so we are notified via * a callback whenever someone signals the underlying eventfd */ init_waitqueue_func_entry(&irqfd->wait, irqfd_wakeup); init_poll_funcptr(&irqfd->pt, irqfd_ptable_queue_proc); spin_lock_irq(&kvm->irqfds.lock); events = file->f_op->poll(file, &irqfd->pt); list_add_tail(&irqfd->list, &kvm->irqfds.items); spin_unlock_irq(&kvm->irqfds.lock); A. /* * Check if there was an event already pending on the eventfd * before we registered, and trigger it as if we didn't miss it. */ if (events & POLLIN) schedule_work(&irqfd->inject); /* * do not drop the file until the irqfd is fully initialized, otherwise * we might race against the POLLHUP */ fput(file); return 0; fail: ... } This is called to mask interrupt /* * shutdown any irqfd's that match fd+gsi */ static int kvm_irqfd_deassign(struct kvm *kvm, int fd, int gsi) { struct _irqfd *irqfd, *tmp; struct eventfd_ctx *eventfd; eventfd = eventfd_ctx_fdget(fd); if (IS_ERR(eventfd)) return PTR_ERR(eventfd); spin_lock_irq(&kvm->irqfds.lock); list_for_each_entry_safe(irqfd, tmp, &kvm->irqfds.items, list) { if (irqfd->eventfd == eventfd && irqfd->gsi == gsi) irqfd_deactivate(irqfd); } spin_unlock_irq(&kvm->irqfds.lock); eventfd_ctx_put(eventfd); /* * Block until we know all outstanding shutdown jobs have completed * so that we guarantee there will not be any more interrupts on this * gsi once this deassign function returns. */ flush_workqueue(irqfd_cleanup_wq); return 0; } And deactivation deep down does this (from irqfd_cleanup_wq workqueue, so this is not under the spinlock): /* * Synchronize with the wait-queue and unhook ourselves to * prevent * further events. */ B. remove_wait_queue(irqfd->wqh, &irqfd->wait); .... /* * It is now safe to release the object's resources */ eventfd_ctx_put(irqfd->eventfd); kfree(irqfd); The problems (really the same bug) in KVM that I am trying to fix: 1. Because of A above, if event was requested while message was masked, we will not miss a message. However, because we never clear the counter, so we currently get a spurious message each time we unmask. We should clear the counter either each time we deliver message, or at least when message is masked. 2. We currently always return pending status 0 to guests, but for strict spec compliance we must do it correctly and report pending message if one was requested while it was masked. An obvious way to do this is to do nonblocking read on the eventfd. Since this clears the counter, we will have to write the value back. But again, this requires that all messages that were sent before mask are cleared from the counter, otherwise we will always report pending messages. So, how should I clear the counter? I started with the idea of clearing it in irqfd_wakeup. However, as you point out this will cause another list scan, so it's messy. It currently uses a workqueue to work around KVM locking issue, but the issue is fixed so we will stop doing this soon, so can not rely on that. Thus I came up with the idea to do this on unmask: this is slow path and can always be done from workqueue or ioctl context. So I would add eventfd_read_ctx in code at point B: eventfd_read_ctx(irqfd->eventfd, &ucnt); -> remove_wait_queue(irqfd->wqh, &irqfd->wait); now, if device signals eventfd at point marked by ->, it will be sent but counter not cleared, so we will get spurious message when we unmask. Or if I do it the other way: remove_wait_queue(irqfd->wqh, &irqfd->wait); -> eventfd_read_ctx(irqfd->eventfd, &ucnt); now, if device signals eventfd at point marked by ->, it will not be sent but counter will be cleared, so we will loose a message. There does not appear to be a way to avoid this race unless we both remove from wait queue and clear counter under the wait queue lock, just like eventfd_read currently does. I hope this clarifies. -- MST -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html