Re: [PATCH RFC] fs/aio: fix sleeping while TASK_INTERRUPTIBLE

Kent Overstreet <kmo@xxxxxxxxxxxxx> · Mon, 29 Dec 2014 14:08:10 -0800

On Mon, Dec 29, 2014 at 10:08:14AM -0500, Chris Mason wrote:
> 
> 
> On Wed, Dec 24, 2014 at 9:56 PM, Kent Overstreet <kmo@xxxxxxxxxxxxx> wrote:
> >On Mon, Dec 22, 2014 at 07:16:25PM -0500, Chris Mason wrote:
> >> The 3.19 merge window brought in a great new warning to catch someone
> >> calling might_sleep with their state != TASK_RUNNING.  The idea was to
> >> find buggy code locking mutexes after calling prepare_to_wait(), kind
> >> of like this:
> >
> >Ben just told me about this issue.
> >
> >IMO, the way the code is structured now is correct, I would argue the
> >problem is
> >with the way wait_event() works - they way they have to mess with the
> >global-ish
> >task state when adding a wait_queue_t to a wait_queue_head (who came up
> >with
> >these names?)
> 
> Grin, probably related to the guy who made closure_wait() not actually wait.

Hah, touche :)

> The advantage to the waitqueue head _t setup is its a very well understood
> mechanism for sleeping on something without missing wakeups. The locking
> overhead for the waitqueues can be a problem for lots of waiters on the same
> queue, but otherwise the overhead is low.

Well, unfortunately I'm not sure it's possible to sanely implement wake_one()
with a lockless singly linked list...

although, maybe we can do it with a lock that's only used for wakups (removing
entries from the list), not adding entries to the list... that would still be
nice...

> I think closures are too big a hammer for this problem, unless benchmarks
> show we need the lockless lists (I really like that part).  I do hesitate to
> make big changes here because debugging AIO hangs is horrible.  The code is
> only tested by a few workloads, and we can go a long time before problems
> are noticed.  When people do hit bugs, we only notice the ones where
> applications pile up in getevents.  Otherwise it's just strange performance
> changes that we can't explain because they are hidden in the app's AIO state
> machine.

Eh - OTOH, closures are old code that have been quite solid and has been
changing very little over the past several years, and they're used very heavily
in bcache (and some other out of tree code). The company I'm at now also has a
pretty extensive nightly test suite for bcache now that's by extension testing
closures too.

And IMO doing this with closures makes the AIO code much cleaner and saner -
reading this version, I'm _much_ more confident the AIO code is correct vs.
having to fiddle with the task state.

> When I first looked at the warning, I didn't realize that might_sleep and
> friends were setting a preempted flag to make sure the task wasn't removed
> from the runqueue.  So I thought we'd potentially sleep forever (thanks
> Peter for details++).  The real risk here is burning CPU in the running
> state, potentially a lot of it if the mutex is highly contended. We've
> probably been hitting this for a while, but since we test AIO performance
> with fast storage, the burning just made us look faster.

This issue btw did come up in reviews when I first wrote this code, people
definitely saw what was going on at the time (Andrew Morton, among others). The
reasoning was that another that's going to sleep is going to just be squashing
our task state setting before they sleep, so it wasn't going to deadlock. And
then if the code does sleep and sets task state back to TASK_RUNNING - because
we just slept we're probably going to want to recheck the conditional again,
which here is pretty cheap.

So I'm still not sure it's a real issue (at least with the mutex_lock() usage;
what Ben was telling me about copy_to_user() sounded more concerning).

OTOH, we recently ran into this same kind of bug in some (out of tree) code of
our own, where the conditional passed to wait_event() was sending RPCs which was
then _always_ setting the task state back to TASK_RUNNING - so a few weeks ago a
coworker brought back closure_wait_event() for that. That bug also made me a bit
more concerned about this type of issue.

I do also thing it's worth having a generic version of wait_event() that anyone
can use with arbitrary conditional expressions without worrying about the safety
of things that might muck with task state - I do feel pretty strongly that that
would find good uses outside of just the AIO code. Lately I've been able to
improve a lot of code by restructuring it to use wait_event(), even when the
conditional is another large function - it a big win to be able to use standard
infrastructure that solves a pretty subtle issue. Just think of it as a better
wait_event() :)
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html