Re: [PATCH RFC] fs/aio: fix sleeping while TASK_INTERRUPTIBLE

Kent Overstreet <kmo@xxxxxxxxxxxxx> · Wed, 24 Dec 2014 18:56:41 -0800

On Mon, Dec 22, 2014 at 07:16:25PM -0500, Chris Mason wrote:
> The 3.19 merge window brought in a great new warning to catch someone
> calling might_sleep with their state != TASK_RUNNING.  The idea was to
> find buggy code locking mutexes after calling prepare_to_wait(), kind
> of like this:

Ben just told me about this issue.

IMO, the way the code is structured now is correct, I would argue the problem is
with the way wait_event() works - they way they have to mess with the global-ish
task state when adding a wait_queue_t to a wait_queue_head (who came up with
these names?)

Bcache's closures don't have this problem; a closure being on a waitlist has
nothing to do with task state - instead, closures keep a counter of the number
of things they're waiting on. You can add a closure to a waitlist and then
separately, later, do a closure_sync() to wait on the closure's remaining count
to hit 0.

Bcache in fact used to have a closure_wait_event() macro that was exactly
analogous to wait_event() but using a closure - I forget what it was used for,
but at some point it wasn't used by bcache anymore and got deleted.

I just cooked up closure_sync_interruptible_hrtimeout() and the corresponding
wait_event macro and then converted aio to use it. This would IMO be a much
cleaner solution to the original problem.

The one disadvantage I know of, with the current code, is that closure waitlists
are singly linked - so they can be lockless, but that means you wake up/remove
a single closure from a waitlist, you have to do wake_up_all() - which is an
obvious disadvantage w.r.t. spurious wakeups. If people like this approach
though I'll just make closure waitlists doubly linked with a lock (which is
something I'd been considering doing anyways)

Here's the patch to the aio code - the rest of the series is in a branch at:
http://evilpiepirate.org/git/linux-bcache.git/log/?h=aio_ring_fix

Disclaimer: code has only been _lightly_ tested so far, the closure hrtimer
stuff was somewhat nontrivial

commit c91f0de111da37581709f7d201793a88c6993188
Author: Kent Overstreet <kmo@xxxxxxxxxxxxx>
Date:   Wed Dec 24 17:20:32 2014 -0800

    aio: Convert to closure waitlist for aio ring buffer
    
    Advantage of closure waitlists is that we don't have to muck with the task state
    before we actually sleep; instead of prepare_to_wait() we do closure_wait(),
    which like prepare_to_wait() adds an object to a waitlist but unlike
    prepare_to_wait it's the closure that's doing the waiting, not the task.
    
    This fixes the issue with doing copy_to_user() after modifying the task state.
    
    Change-Id: Ifc75123d5bb620277d1e78dd5102e5d8bead1add

diff --git a/fs/aio.c b/fs/aio.c
index 1b7893ecc2..284c74e624 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -40,6 +40,7 @@
 #include <linux/ramfs.h>
 #include <linux/percpu-refcount.h>
 #include <linux/mount.h>
+#include <linux/closure.h>
 
 #include <asm/kmap_types.h>
 #include <asm/uaccess.h>
@@ -136,7 +137,7 @@ struct kioctx {
 
 	struct {
 		struct mutex	ring_lock;
-		wait_queue_head_t wait;
+		struct closure_waitlist wait;
 	} ____cacheline_aligned_in_smp;
 
 	struct {
@@ -689,7 +690,6 @@ static struct kioctx *ioctx_alloc(unsigned nr_events)
 	/* Protect against page migration throughout kiotx setup by keeping
 	 * the ring_lock mutex held until setup is complete. */
 	mutex_lock(&ctx->ring_lock);
-	init_waitqueue_head(&ctx->wait);
 
 	INIT_LIST_HEAD(&ctx->active_reqs);
 
@@ -772,7 +772,7 @@ static int kill_ioctx(struct mm_struct *mm, struct kioctx *ctx,
 	spin_unlock(&mm->ioctx_lock);
 
 	/* percpu_ref_kill() will do the necessary call_rcu() */
-	wake_up_all(&ctx->wait);
+	closure_wake_up(&ctx->wait);
 
 	/*
 	 * It'd be more correct to do this in free_ioctx(), after all
@@ -1121,8 +1121,7 @@ void aio_complete(struct kiocb *iocb, long res, long res2)
 	 */
 	smp_mb();
 
-	if (waitqueue_active(&ctx->wait))
-		wake_up(&ctx->wait);
+	closure_wake_up(&ctx->wait);
 
 	percpu_ref_put(&ctx->reqs);
 }
@@ -1237,26 +1236,15 @@ static long read_events(struct kioctx *ctx, long min_nr, long nr,
 			return -EFAULT;
 
 		until = timespec_to_ktime(ts);
+
+		if (until.tv64)
+			until = ktime_add(ktime_get(), until);
 	}
 
-	/*
-	 * Note that aio_read_events() is being called as the conditional - i.e.
-	 * we're calling it after prepare_to_wait() has set task state to
-	 * TASK_INTERRUPTIBLE.
-	 *
-	 * But aio_read_events() can block, and if it blocks it's going to flip
-	 * the task state back to TASK_RUNNING.
-	 *
-	 * This should be ok, provided it doesn't flip the state back to
-	 * TASK_RUNNING and return 0 too much - that causes us to spin. That
-	 * will only happen if the mutex_lock() call blocks, and we then find
-	 * the ringbuffer empty. So in practice we should be ok, but it's
-	 * something to be aware of when touching this code.
-	 */
 	if (until.tv64 == 0)
 		aio_read_events(ctx, min_nr, nr, event, &ret);
 	else
-		wait_event_interruptible_hrtimeout(ctx->wait,
+		closure_wait_event_hrtimeout(&ctx->wait,
 				aio_read_events(ctx, min_nr, nr, event, &ret),
 				until);
 
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html