Re: [tip:locking/core] sched/wait: Fix signal handling in bit wait helpers

Paul Turner <pjt@xxxxxxxxxx> · Fri, 11 Dec 2015 03:30:33 -0800

On Thu, Dec 10, 2015 at 5:09 AM, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> On Thu, Dec 10, 2015 at 08:30:01AM +1100, NeilBrown wrote:
>> On Wed, Dec 09 2015, Peter Zijlstra wrote:
>>
>> > On Wed, Dec 09, 2015 at 12:06:33PM +1100, NeilBrown wrote:
>> >> On Tue, Dec 08 2015, Peter Zijlstra wrote:
>> >>
>> >> >>
>> >> >
>> >> > *sigh*, so that patch was broken.. the below might fix it, but please
>> >> > someone look at it, I seem to have a less than stellar track record
>> >> > here...
>> >>
>> >> This new change seems to be more intrusive than should be needed.
>> >> Can't we just do:
>> >>
>> >>
>> >>  __sched int bit_wait(struct wait_bit_key *word)
>> >>  {
>> >> +  long state = current->state;
>> >
>> > No, current->state can already be changed by this time.
>>
>> Does that matter?
>> It can only have changed to TASK_RUNNING - right?
>> In that case signal_pending_state() will return 0 and the bit_wait() acts
>> as though the thread was woken up normally (which it was) rather than by
>> a signal (which maybe it was too, but maybe that happened just a tiny
>> bit later).
>>
>> As long as signal delivery doesn't change ->state, we should be safe.
>> We should even be safe testing ->state *after* the call the schedule().
>
> Blergh, all I've managed to far is to confuse myself further. Even
> something like the original (+- the EINTR) should work when we consider
> the looping, even when mixed with an occasional spurious wakeup.
>
>
> int bit_wait()
> {
>         if (signal_pending_state(current->state, current))
>                 return -EINTR;
>         schedule();
> }
>
>
> This can go wrong against raising a signal thusly:
>
>         prepare_to_wait()
> 1:      if (signal_pending_state(current->state, current))
>                 // false, nothing pending
>         schedule();
>                                 set_tsk_thread_flag(t, TIF_SIGPENDING);
>
>                 <spurious wakeup>
>
>         prepare_to_wait()
>                                 wake_up_state(t, ...);
> 2:      if (signal_pending_state(current->state, current))
>                 // false, TASK_RUNNING
>
>         schedule(); // doesn't block because pending

Note that a quick inspection does not turn up _any_ TASK_INTERRUPTIBLE
callers.  When this previously occurred, it could likely only be with
a fatal signal, which would have hidden these sins.

>
>         prepare_to_wait()
> 3:      if (signal_pending_state(current->state, current))
>                 // true, pending
>

Hugh asked me about this after seeing a crash, here's another exciting
way in which the current code breaks -- this one actually quite
serious:

Consider __lock_page:

void __lock_page(struct page *page)
{
        DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
        __wait_on_bit_lock(page_waitqueue(page), &wait, bit_wait_io,
TASK_UNINTERRUPTIBLE);
}

With the current state of the world,

 __sched int bit_wait_io(struct wait_bit_key *word)
 {
-       if (signal_pending_state(current->state, current))
-               return 1;
        io_schedule();
+       if (signal_pending(current))
+               return -EINTR;
        return 0;
 }

Called from __wait_on_bit_lock.

Previously, signal_pending_state() was checked under
TASK_UNINTERRUPTIBLE (via prepare_to_wait_exclusive).  Now, we simply
check for the presence of any signal -- after we have returned to
running state, e.g. post io_schedule() when somebody has kicked the
wait-queue.

However, this now means that _wait_on_bit_lock can return -EINTR up to
__lock_page; which does not validate the return code and blindly
returns.  This looks to have been a previously existing bug, but it
was at least masked by the fact that it required a fatal signal
previously (and that the page we return unlocked is likely going to be
freed from the dying process anyway).

Peter's proposed follow-up above looks strictly more correct.  We need
to evaluate the potential existence of a signal, *after* we return
from schedule, but in the context of the state which we previously
_entered_ schedule() on.

Reviewed-by: Paul Turner <pjt@xxxxxxxxxx>

>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-tip-commits" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html