On Thu, Jan 07, 2010 at 04:26:24PM -0800, Davide Libenzi wrote: > On Thu, 7 Jan 2010, Michael S. Tsirkin wrote: > > > Sure, I was trying to be as brief as possible, here's a detailed summary. > > > > Description of the system (MSI emulation in KVM): > > > > KVM supports an ioctl to assign/deassign an eventfd file to interrupt message > > in guest OS. When this eventfd is signalled, interrupt message is sent. > > This assignment is done from qemu system emulator. > > > > eventfd is signalled from device emulation in another thread in > > userspace or from kernel, which talks with guest OS through another > > eventfd and shared memory (possibility of out of process was discussed > > but never got implemented yet). > > > > Note: it's okay to delay messages from correctness point of view, but > > generally this is latency-sensitive path. If multiple identical messages > > are requested, it's okay to send a single last message, but missing a > > message altogether causes deadlocks. Sending a message when none were > > requested might in theory cause crashes, in practice doing this causes > > performance degradation. > > > > Another KVM feature is interrupt masking: guest OS requests that we > > stop sending some interrupt message, possibly modified mapping > > and re-enables this message. This needs to be done without > > involving the device that might keep requesting events: > > while masked, message is marked "pending", and guest might test > > the pending status. > > > > We can implement masking in system emulator in userspace, by using > > assign/deassign ioctls: when message is masked, we simply deassign all > > eventfd, and when it is unmasked, we assign them back. > > > > Here's some code to illustrate how this all works: assign/deassign code > > in kernel looks like the following: > > > > > > this is called to unmask interrupt > > > > static int > > kvm_irqfd_assign(struct kvm *kvm, int fd, int gsi) > > { > > struct _irqfd *irqfd, *tmp; > > struct file *file = NULL; > > struct eventfd_ctx *eventfd = NULL; > > int ret; > > unsigned int events; > > > > irqfd = kzalloc(sizeof(*irqfd), GFP_KERNEL); > > > > ... > > > > file = eventfd_fget(fd); > > if (IS_ERR(file)) { > > ret = PTR_ERR(file); > > goto fail; > > } > > > > eventfd = eventfd_ctx_fileget(file); > > if (IS_ERR(eventfd)) { > > ret = PTR_ERR(eventfd); > > goto fail; > > } > > > > irqfd->eventfd = eventfd; > > > > /* > > * Install our own custom wake-up handling so we are notified via > > * a callback whenever someone signals the underlying eventfd > > */ > > init_waitqueue_func_entry(&irqfd->wait, irqfd_wakeup); > > init_poll_funcptr(&irqfd->pt, irqfd_ptable_queue_proc); > > > > spin_lock_irq(&kvm->irqfds.lock); > > > > events = file->f_op->poll(file, &irqfd->pt); > > > > list_add_tail(&irqfd->list, &kvm->irqfds.items); > > spin_unlock_irq(&kvm->irqfds.lock); > > > > A. > > /* > > * Check if there was an event already pending on the eventfd > > * before we registered, and trigger it as if we didn't miss it. > > */ > > if (events & POLLIN) > > schedule_work(&irqfd->inject); > > > > /* > > * do not drop the file until the irqfd is fully initialized, otherwise > > * we might race against the POLLHUP > > */ > > fput(file); > > > > return 0; > > > > fail: > > ... > > } > > What is you do (under proper irqfd locking) something like: > > eventfd_ctx_read(ctx, 1, &cnt); > if (irqfd->cnt != cnt) { > irqfd->cnt = cnt; > schedule_work(&irqfd->inject); > } > > > > > > And deactivation deep down does this (from irqfd_cleanup_wq workqueue, > > so this is not under the spinlock): > > > > /* > > * Synchronize with the wait-queue and unhook ourselves to > > * prevent > > * further events. > > */ > > B. > > remove_wait_queue(irqfd->wqh, &irqfd->wait); > > > > .... > > > > /* > > * It is now safe to release the object's resources > > */ > > eventfd_ctx_put(irqfd->eventfd); > > kfree(irqfd); > > And: > > eventfd_ctx_read(ctx, 1, &irqfd->cnt); -> > remove_wait_queue(irqfd->wqh, &irqfd->wait); > > > > > - Davide Yes, this is exactly what I wanted to do. So, here's the issue: if an event is signalled at point ->: after eventfd_ctx_read but before remove_wait_queue, then we inject interrupt but counter will be left non-zero and then when we unmask, we inject antoher, spurious interrupt. This is why I wanted to have eventfd_ctx_read not take wait queue head lock: then I could do: spin_lock_irqsave(&ctx->wqh.lock, flags); eventfd_ctx_read(ctx, 1, &irqfd->cnt); __remove_wait_queue(irqfd->wqh, &irqfd->wait); spin_lock_irqrestore(&ctx->wqh.lock, flags); -- MST -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html