Re: [PATCH 0/2] eventfd: new EFD_STATE flag

"Michael S. Tsirkin" <mst@xxxxxxxxxx> · Sun, 10 Jan 2010 12:30:08 +0200

On Thu, Jan 07, 2010 at 04:26:24PM -0800, Davide Libenzi wrote:
> On Thu, 7 Jan 2010, Michael S. Tsirkin wrote:
> 
> > Sure, I was trying to be as brief as possible, here's a detailed summary.
> > 
> > Description of the system (MSI emulation in KVM):
> > 
> > KVM supports an ioctl to assign/deassign an eventfd file to interrupt message
> > in guest OS.  When this eventfd is signalled, interrupt message is sent.
> > This assignment is done from qemu system emulator.
> > 
> > eventfd is signalled from device emulation in another thread in
> > userspace or from kernel, which talks with guest OS through another
> > eventfd and shared memory (possibility of out of process was discussed
> > but never got implemented yet).
> > 
> > Note: it's okay to delay messages from correctness point of view, but
> > generally this is latency-sensitive path. If multiple identical messages
> > are requested, it's okay to send a single last message, but missing a
> > message altogether causes deadlocks.  Sending a message when none were
> > requested might in theory cause crashes, in practice doing this causes
> > performance degradation.
> > 
> > Another KVM feature is interrupt masking: guest OS requests that we
> > stop sending some interrupt message, possibly modified mapping
> > and re-enables this message. This needs to be done without
> > involving the device that might keep requesting events:
> > while masked, message is marked "pending", and guest might test
> > the pending status.
> > 
> > We can implement masking in system emulator in userspace, by using
> > assign/deassign ioctls: when message is masked, we simply deassign all
> > eventfd, and when it is unmasked, we assign them back.
> > 
> > Here's some code to illustrate how this all works: assign/deassign code
> > in kernel looks like the following:
> > 
> > 
> > this is called to unmask interrupt
> > 
> > static int
> > kvm_irqfd_assign(struct kvm *kvm, int fd, int gsi)
> > {
> > 	struct _irqfd *irqfd, *tmp;
> > 	struct file *file = NULL;
> > 	struct eventfd_ctx *eventfd = NULL;
> > 	int ret;
> > 	unsigned int events;
> > 
> > 	irqfd = kzalloc(sizeof(*irqfd), GFP_KERNEL);
> > 
> > ...
> > 
> > 	file = eventfd_fget(fd);
> > 	if (IS_ERR(file)) {
> > 		ret = PTR_ERR(file);
> > 		goto fail;
> > 	}
> > 
> > 	eventfd = eventfd_ctx_fileget(file);
> > 	if (IS_ERR(eventfd)) {
> > 		ret = PTR_ERR(eventfd);
> > 		goto fail;
> > 	}
> > 
> > 	irqfd->eventfd = eventfd;
> > 
> > 	/*
> > 	 * Install our own custom wake-up handling so we are notified via
> > 	 * a callback whenever someone signals the underlying eventfd
> > 	 */
> > 	init_waitqueue_func_entry(&irqfd->wait, irqfd_wakeup);
> > 	init_poll_funcptr(&irqfd->pt, irqfd_ptable_queue_proc);
> > 
> > 	spin_lock_irq(&kvm->irqfds.lock);
> > 
> > 	events = file->f_op->poll(file, &irqfd->pt);
> > 
> > 	list_add_tail(&irqfd->list, &kvm->irqfds.items);
> > 	spin_unlock_irq(&kvm->irqfds.lock);
> > 
> > A.
> > 	/*
> > 	 * Check if there was an event already pending on the eventfd
> > 	 * before we registered, and trigger it as if we didn't miss it.
> > 	 */
> > 	if (events & POLLIN)
> > 		schedule_work(&irqfd->inject);
> > 
> > 	/*
> > 	 * do not drop the file until the irqfd is fully initialized, otherwise
> > 	 * we might race against the POLLHUP
> > 	 */
> > 	fput(file);
> > 
> > 	return 0;
> > 
> > fail:
> > 	...
> > }
> 
> What is you do (under proper irqfd locking) something like:
> 
> 	eventfd_ctx_read(ctx, 1, &cnt);
> 	if (irqfd->cnt != cnt) {
> 		irqfd->cnt = cnt;
> 		schedule_work(&irqfd->inject);
> 	}
> 
> 
> 
> 
> > And deactivation deep down does this (from irqfd_cleanup_wq workqueue,
> > so this is not under the spinlock):
> > 
> >         /*
> >          * Synchronize with the wait-queue and unhook ourselves to
> >          * prevent
> >          * further events.
> >          */
> > B.
> >         remove_wait_queue(irqfd->wqh, &irqfd->wait);
> > 
> > 	....
> > 
> >         /*
> >          * It is now safe to release the object's resources
> >          */
> >         eventfd_ctx_put(irqfd->eventfd);
> >         kfree(irqfd);
> 
> And:
> 
> 	eventfd_ctx_read(ctx, 1, &irqfd->cnt);

->

> 	remove_wait_queue(irqfd->wqh, &irqfd->wait);
> 
> 
> 
> 
> - Davide

Yes, this is exactly what I wanted to do.  So, here's the issue: if an
event is signalled at point ->: after eventfd_ctx_read but before
remove_wait_queue, then we inject interrupt but counter will be left
non-zero and then when we unmask, we inject antoher, spurious interrupt.

This is why I wanted to have eventfd_ctx_read not take wait queue head
lock: then I could do:

	spin_lock_irqsave(&ctx->wqh.lock, flags);
 	eventfd_ctx_read(ctx, 1, &irqfd->cnt);
 	__remove_wait_queue(irqfd->wqh, &irqfd->wait);
	spin_lock_irqrestore(&ctx->wqh.lock, flags);

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html