Re: [PATCH 5/7] userfaultfd: switch to exclusive wakeup for blocking reads

Andrea Arcangeli <aarcange@xxxxxxxxxx> · Tue, 16 Jun 2015 00:19:46 +0200

On Mon, Jun 15, 2015 at 08:19:07AM -1000, Linus Torvalds wrote:
> What if the process doing the polling never doors anything with the end
> result? Maybe it meant to, but it got killed before it could? Are you going
> to leave everybody else blocked, even though there are pending events?

Yes, it would leave the other blocked, how is it different from having
just 1 reader and it gets killed?

If any qemu thread gets killed the thing is going to be noticeable,
there's no fault-tolerance-double-thread for anything. If one wants to
use more threads for fault tolerance of this scenario with
userfaultfd, one just needs to add a feature flag to the
uffdio_api.features to request it and change the behavior to wakeall
but by default if we can do wakeone I think we should.

> The same us try of read() too. What if the reader only reads party of the
> message? The wake didn't wake anybody else, so now people are (again)
> blocked despite there being data.

I totally agree that for a normal read that would be a concern, but
the wakeone only applies to the uffd. I'm not even trying to change
other read methods.

The uffd can't short-read. Lengths not multiple of sizeof(struct
uffd_msg) immediately return -EINVAL. read will return one or more
events, sizeof(struct uffd_msg). Signal interruptions only are
reported if it's about to block and it found nothing.

> So no, exclusive waiting is never "simple". You have to 100% guarantee that
> you will consume all the data that caused the wake event (or perhaps wake
> the next person up if you don't).

I don't see where it goes wrong.

Now if __wake_up_common didn't check the retval of
default_wake_function->try_to_wake_up before decrements and checking
nr_exclusive I would where the problem about the next guy is, but it
does this:

		if (curr->func(curr, mode, wake_flags, key) &&
				(flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive)
			break;

Every new userfault blocking (and at max 1 event to read is generated
for each new blocking userfaults) wakes one more reader, and each
reader is guaranteed to be blocked only if the pending (pending as not
read yet) waitqueue is truly empty. Where does it misbehave?

Yes each reader is required then to handle whatever userfault event it
got from read (or to pass it to another thread before quitting), but
this is a must anyway. This is because after the userfault is read it
is moved from pending fault queue to normal fault queue, so it won't
ever be read again, if it wasn't the case read would infinite loop and
it couldn't block (the same applies to poll, poll blocks after the
pending event has been read).

The testsuite can reproduce the bug fixed in 4/7 in like 3 seconds,
and it's 100% reproducible. And the window for such a bug is really
small: exactly in between list_del(); list_add the two
waitqueue_active must run in the other CPU. So it's hard to imagine if
this had some major issue, the testsuite wouldn't show it. In fact the
load seems to scale more evenly across all uffd threads too without no
apparent downside.

qemu uses just one reader, and it's even using poll, so this is not
needed for the short term production code, and it's totally fine to
defer this patch.

I'm not saying doing wakeone is easy and it's enough to flip a switch
everywhere to get it everywhere, and perhaps there's something wrong
still, I just I don't see where the actual bug is and how it should
work better without this patch but it's certainly fine to drop the
patch anyway (at least for now).

Thanks,
Andrea
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html