Re: 5.13-rt1 + KVM = WARNING: at fs/eventfd.c:74 eventfd_signal()

Hillf Danton <hdanton@xxxxxxxx> · Fri, 23 Jul 2021 10:23:56 +0800




On Wed, 21 Jul 2021 12:59:39 +0200 Paolo Bonzini wrote:
>On 21/07/21 12:11, Hillf Danton wrote:
>> On Wed, 21 Jul 2021 09:25:32 +0200 Thomas Gleixner wrote:
>>> On Wed, Jul 21 2021 at 15:04, Hillf Danton wrote:
>>>>
>>>> But the preempting waker can not make sense without the waiter who is bloody
>>>> special. Why is it so in the first place? Or it is not at all but the race
>>>> existing from Monday to Friday.
>>>
>>> See the large comment in eventfd_poll().
>> 
>> Is it likely for a reader to make eventfd_poll() return 0?
>> 
>> read	 *     poll                               write
>> ----	 *     -----------------                  ------------
>> 	 *     count = ctx->count (INVALID!)
>> 	 *                                        lock ctx->qwh.lock
>> 	 *                                        ctx->count += n
>> 	 *                                        **waitqueue_active is false**
>> 	 *                                        **no wake_up_locked_poll!**
>> 	 *                                        unlock ctx->qwh.lock
>> 
>> lock ctx->qwh.lock
>> *cnt = (ctx->flags & EFD_SEMAPHORE) ? 1 : ctx->count;
>> ctx->count -= *cnt;
>> **waitqueue_active is false**
>> unlock ctx->qwh.lock
>> 
>> 	 *     lock ctx->wqh.lock (in poll_wait)
>> 	 *     __add_wait_queue
>> 	 *     unlock ctx->wqh.lock
>> 	 *     eventfd_poll returns 0
>> 	 */
>> 	count = READ_ONCE(ctx->count);
>> 
>
>No, it's simply impossible.  The same comment explains why: "count = 
>ctx->count" cannot move above poll_wait's locking of ctx->wqh.lock.

Detect concurrent reader and writer by reading event counter before and
after poll_wait(), and determine feedback with the case of unstable
counter taken into account.

Cut the big comment as the added barriers speak for themselves.

+++ x/fs/eventfd.c
@@ -131,49 +131,20 @@ static __poll_t eventfd_poll(struct file
 {
 	struct eventfd_ctx *ctx = file->private_data;
 	__poll_t events = 0;
-	u64 count;
+	u64 c0, count;
+
+	c0 = ctx->count;
+	smp_rmb();
 
 	poll_wait(file, &ctx->wqh, wait);
 
-	/*
-	 * All writes to ctx->count occur within ctx->wqh.lock.  This read
-	 * can be done outside ctx->wqh.lock because we know that poll_wait
-	 * takes that lock (through add_wait_queue) if our caller will sleep.
-	 *
-	 * The read _can_ therefore seep into add_wait_queue's critical
-	 * section, but cannot move above it!  add_wait_queue's spin_lock acts
-	 * as an acquire barrier and ensures that the read be ordered properly
-	 * against the writes.  The following CAN happen and is safe:
-	 *
-	 *     poll                               write
-	 *     -----------------                  ------------
-	 *     lock ctx->wqh.lock (in poll_wait)
-	 *     count = ctx->count
-	 *     __add_wait_queue
-	 *     unlock ctx->wqh.lock
-	 *                                        lock ctx->qwh.lock
-	 *                                        ctx->count += n
-	 *                                        if (waitqueue_active)
-	 *                                          wake_up_locked_poll
-	 *                                        unlock ctx->qwh.lock
-	 *     eventfd_poll returns 0
-	 *
-	 * but the following, which would miss a wakeup, cannot happen:
-	 *
-	 *     poll                               write
-	 *     -----------------                  ------------
-	 *     count = ctx->count (INVALID!)
-	 *                                        lock ctx->qwh.lock
-	 *                                        ctx->count += n
-	 *                                        **waitqueue_active is false**
-	 *                                        **no wake_up_locked_poll!**
-	 *                                        unlock ctx->qwh.lock
-	 *     lock ctx->wqh.lock (in poll_wait)
-	 *     __add_wait_queue
-	 *     unlock ctx->wqh.lock
-	 *     eventfd_poll returns 0
-	 */
-	count = READ_ONCE(ctx->count);
+	smp_rmb();
+	count = ctx->count;
+
+	if (c0 < count)
+		return EPOLLIN;
+	if (c0 > count)
+		return EPOLLOUT;
 
 	if (count > 0)
 		events |= EPOLLIN;