On Mon, Dec 22, 2014 at 07:16:25PM -0500, Chris Mason wrote: > The 3.19 merge window brought in a great new warning to catch someone > calling might_sleep with their state != TASK_RUNNING. The idea was to > find buggy code locking mutexes after calling prepare_to_wait(), kind > of like this: Ben just told me about this issue. IMO, the way the code is structured now is correct, I would argue the problem is with the way wait_event() works - they way they have to mess with the global-ish task state when adding a wait_queue_t to a wait_queue_head (who came up with these names?) Bcache's closures don't have this problem; a closure being on a waitlist has nothing to do with task state - instead, closures keep a counter of the number of things they're waiting on. You can add a closure to a waitlist and then separately, later, do a closure_sync() to wait on the closure's remaining count to hit 0. Bcache in fact used to have a closure_wait_event() macro that was exactly analogous to wait_event() but using a closure - I forget what it was used for, but at some point it wasn't used by bcache anymore and got deleted. I just cooked up closure_sync_interruptible_hrtimeout() and the corresponding wait_event macro and then converted aio to use it. This would IMO be a much cleaner solution to the original problem. The one disadvantage I know of, with the current code, is that closure waitlists are singly linked - so they can be lockless, but that means you wake up/remove a single closure from a waitlist, you have to do wake_up_all() - which is an obvious disadvantage w.r.t. spurious wakeups. If people like this approach though I'll just make closure waitlists doubly linked with a lock (which is something I'd been considering doing anyways) Here's the patch to the aio code - the rest of the series is in a branch at: http://evilpiepirate.org/git/linux-bcache.git/log/?h=aio_ring_fix Disclaimer: code has only been _lightly_ tested so far, the closure hrtimer stuff was somewhat nontrivial commit c91f0de111da37581709f7d201793a88c6993188 Author: Kent Overstreet <kmo@xxxxxxxxxxxxx> Date: Wed Dec 24 17:20:32 2014 -0800 aio: Convert to closure waitlist for aio ring buffer Advantage of closure waitlists is that we don't have to muck with the task state before we actually sleep; instead of prepare_to_wait() we do closure_wait(), which like prepare_to_wait() adds an object to a waitlist but unlike prepare_to_wait it's the closure that's doing the waiting, not the task. This fixes the issue with doing copy_to_user() after modifying the task state. Change-Id: Ifc75123d5bb620277d1e78dd5102e5d8bead1add diff --git a/fs/aio.c b/fs/aio.c index 1b7893ecc2..284c74e624 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -40,6 +40,7 @@ #include <linux/ramfs.h> #include <linux/percpu-refcount.h> #include <linux/mount.h> +#include <linux/closure.h> #include <asm/kmap_types.h> #include <asm/uaccess.h> @@ -136,7 +137,7 @@ struct kioctx { struct { struct mutex ring_lock; - wait_queue_head_t wait; + struct closure_waitlist wait; } ____cacheline_aligned_in_smp; struct { @@ -689,7 +690,6 @@ static struct kioctx *ioctx_alloc(unsigned nr_events) /* Protect against page migration throughout kiotx setup by keeping * the ring_lock mutex held until setup is complete. */ mutex_lock(&ctx->ring_lock); - init_waitqueue_head(&ctx->wait); INIT_LIST_HEAD(&ctx->active_reqs); @@ -772,7 +772,7 @@ static int kill_ioctx(struct mm_struct *mm, struct kioctx *ctx, spin_unlock(&mm->ioctx_lock); /* percpu_ref_kill() will do the necessary call_rcu() */ - wake_up_all(&ctx->wait); + closure_wake_up(&ctx->wait); /* * It'd be more correct to do this in free_ioctx(), after all @@ -1121,8 +1121,7 @@ void aio_complete(struct kiocb *iocb, long res, long res2) */ smp_mb(); - if (waitqueue_active(&ctx->wait)) - wake_up(&ctx->wait); + closure_wake_up(&ctx->wait); percpu_ref_put(&ctx->reqs); } @@ -1237,26 +1236,15 @@ static long read_events(struct kioctx *ctx, long min_nr, long nr, return -EFAULT; until = timespec_to_ktime(ts); + + if (until.tv64) + until = ktime_add(ktime_get(), until); } - /* - * Note that aio_read_events() is being called as the conditional - i.e. - * we're calling it after prepare_to_wait() has set task state to - * TASK_INTERRUPTIBLE. - * - * But aio_read_events() can block, and if it blocks it's going to flip - * the task state back to TASK_RUNNING. - * - * This should be ok, provided it doesn't flip the state back to - * TASK_RUNNING and return 0 too much - that causes us to spin. That - * will only happen if the mutex_lock() call blocks, and we then find - * the ringbuffer empty. So in practice we should be ok, but it's - * something to be aware of when touching this code. - */ if (until.tv64 == 0) aio_read_events(ctx, min_nr, nr, event, &ret); else - wait_event_interruptible_hrtimeout(ctx->wait, + closure_wait_event_hrtimeout(&ctx->wait, aio_read_events(ctx, min_nr, nr, event, &ret), until); -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html