Re: [PATCH 0/2] eventfd: new EFD_STATE flag

"Michael S. Tsirkin" <mst@xxxxxxxxxx> · Thu, 7 Jan 2010 12:36:01 +0200

On Wed, Jan 06, 2010 at 11:25:40PM -0800, Davide Libenzi wrote:
> On Thu, 7 Jan 2010, Michael S. Tsirkin wrote:
> 
> > OK. What I think we need is a way to remove ourselves from the eventfd
> > wait queue and clear the counter atomically.
> > 
> > We currently do
> >         remove_wait_queue(irqfd->wqh, &irqfd->wait);
> > where wqh saves the eventfd wait queue head.
> 
> You do a remove_wait_queue() from inside a callback wakeup on the same 
> wait queue head?
> 

No, not from callback, in ioctl context.

> > If we do this before proposed eventfd_read_ctx, we can lose events.
> > If we do this after, we can get spurious events.
> > 
> > An unlocked read is one way to fix this.
> 
> You posted one line of code and a two lines analysis of the issue. Can you 
> be a little bit more verbose and show me more code, so that I can actually 
> see what is going on?
> 
> 
> - Davide

Sure, I was trying to be as brief as possible, here's a detailed summary.

Description of the system (MSI emulation in KVM):

KVM supports an ioctl to assign/deassign an eventfd file to interrupt message
in guest OS.  When this eventfd is signalled, interrupt message is sent.
This assignment is done from qemu system emulator.

eventfd is signalled from device emulation in another thread in
userspace or from kernel, which talks with guest OS through another
eventfd and shared memory (possibility of out of process was discussed
but never got implemented yet).

Note: it's okay to delay messages from correctness point of view, but
generally this is latency-sensitive path. If multiple identical messages
are requested, it's okay to send a single last message, but missing a
message altogether causes deadlocks.  Sending a message when none were
requested might in theory cause crashes, in practice doing this causes
performance degradation.

Another KVM feature is interrupt masking: guest OS requests that we
stop sending some interrupt message, possibly modified mapping
and re-enables this message. This needs to be done without
involving the device that might keep requesting events:
while masked, message is marked "pending", and guest might test
the pending status.

We can implement masking in system emulator in userspace, by using
assign/deassign ioctls: when message is masked, we simply deassign all
eventfd, and when it is unmasked, we assign them back.

Here's some code to illustrate how this all works: assign/deassign code
in kernel looks like the following:

this is called to unmask interrupt

static int
kvm_irqfd_assign(struct kvm *kvm, int fd, int gsi)
{
	struct _irqfd *irqfd, *tmp;
	struct file *file = NULL;
	struct eventfd_ctx *eventfd = NULL;
	int ret;
	unsigned int events;

	irqfd = kzalloc(sizeof(*irqfd), GFP_KERNEL);

...

	file = eventfd_fget(fd);
	if (IS_ERR(file)) {
		ret = PTR_ERR(file);
		goto fail;
	}

	eventfd = eventfd_ctx_fileget(file);
	if (IS_ERR(eventfd)) {
		ret = PTR_ERR(eventfd);
		goto fail;
	}

	irqfd->eventfd = eventfd;

	/*
	 * Install our own custom wake-up handling so we are notified via
	 * a callback whenever someone signals the underlying eventfd
	 */
	init_waitqueue_func_entry(&irqfd->wait, irqfd_wakeup);
	init_poll_funcptr(&irqfd->pt, irqfd_ptable_queue_proc);

	spin_lock_irq(&kvm->irqfds.lock);

	events = file->f_op->poll(file, &irqfd->pt);

	list_add_tail(&irqfd->list, &kvm->irqfds.items);
	spin_unlock_irq(&kvm->irqfds.lock);

A.
	/*
	 * Check if there was an event already pending on the eventfd
	 * before we registered, and trigger it as if we didn't miss it.
	 */
	if (events & POLLIN)
		schedule_work(&irqfd->inject);

	/*
	 * do not drop the file until the irqfd is fully initialized, otherwise
	 * we might race against the POLLHUP
	 */
	fput(file);

	return 0;

fail:
	...
}

This is called to mask interrupt

/*
 * shutdown any irqfd's that match fd+gsi
 */
static int
kvm_irqfd_deassign(struct kvm *kvm, int fd, int gsi)
{
	struct _irqfd *irqfd, *tmp;
	struct eventfd_ctx *eventfd;

	eventfd = eventfd_ctx_fdget(fd);
	if (IS_ERR(eventfd))
		return PTR_ERR(eventfd);

	spin_lock_irq(&kvm->irqfds.lock);

	list_for_each_entry_safe(irqfd, tmp, &kvm->irqfds.items, list) {
		if (irqfd->eventfd == eventfd && irqfd->gsi == gsi)
			irqfd_deactivate(irqfd);
	}

	spin_unlock_irq(&kvm->irqfds.lock);
	eventfd_ctx_put(eventfd);

	/*
	 * Block until we know all outstanding shutdown jobs have completed
	 * so that we guarantee there will not be any more interrupts on this
	 * gsi once this deassign function returns.
	 */
	flush_workqueue(irqfd_cleanup_wq);

	return 0;
}

And deactivation deep down does this (from irqfd_cleanup_wq workqueue,
so this is not under the spinlock):

        /*
         * Synchronize with the wait-queue and unhook ourselves to
         * prevent
         * further events.
         */
B.
        remove_wait_queue(irqfd->wqh, &irqfd->wait);

	....

        /*
         * It is now safe to release the object's resources
         */
        eventfd_ctx_put(irqfd->eventfd);
        kfree(irqfd);

The problems (really the same bug) in KVM that I am trying to fix:
1. Because of A above, if event was requested while message was masked,
   we will not miss a message. However, because we never clear
   the counter, so we currently get a spurious message each time
   we unmask.

   We should clear the counter either each time we
   deliver message, or at least when message is masked.

2. We currently always return pending status 0 to guests,
   but for strict spec compliance we must do it correctly
   and report pending message if one was requested while
   it was masked.

   An obvious way to do this is to do nonblocking read on the eventfd.
   Since this clears the counter, we will have to write the value
   back. But again, this requires that all messages that were
   sent before mask are cleared from the counter, otherwise
   we will always report pending messages.

So, how should I clear the counter?

I started with the idea of clearing it in irqfd_wakeup.  However, as you
point out this will cause another list scan, so it's messy.  It
currently uses a workqueue to work around KVM locking issue, but the
issue is fixed so we will stop doing this soon, so can not rely on that.

Thus I came up with the idea to do this on unmask:
this is slow path and can always be done from workqueue or ioctl
context. So I would add eventfd_read_ctx in code at point B:

		eventfd_read_ctx(irqfd->eventfd, &ucnt);
	->
		remove_wait_queue(irqfd->wqh, &irqfd->wait);

now, if device signals eventfd at point marked by ->,
it will be sent but counter not cleared,
so we will get spurious message when we unmask.

Or if I do it the other way:
		remove_wait_queue(irqfd->wqh, &irqfd->wait);
	->
		eventfd_read_ctx(irqfd->eventfd, &ucnt);

now, if device signals eventfd at point marked by ->,
it will not be sent but counter will be cleared,
so we will loose a message.

There does not appear to be a way to avoid this race unless we both
remove from wait queue and clear counter under the wait queue lock, just
like eventfd_read currently does.

I hope this clarifies.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html