On Sun, Aug 27, 2017 at 2:40 PM, Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote: > > The race goes like this: > > thread1 thread2 thread3 > ---- ---- ---- > > .. CPU1 ... > __lock_page_killable > wait_on_page_bit_common() > get wq lock > __add_wait_queue_entry_tail_exclusive() > set_current_state(TASK_KILLABLE); > release wq lock > io_schedule > > ... CPU2 ... > __lock_page[_killable]() > wait_on_page_bit_common() > get wq lock > __add_wait_queue_entry_tail_exclusive() > set_current_state(TASK_KILLABLE); > release wq lock > io_schedule > > ... CPU3 ... > unlock_page() > wake_up_page_bit(page, PG_Locked) > wakes up CPU1 _only_ > > ... lethal signal for thread1 happens ... > if (unlikely(signal_pending_state(state, current))) { > ret = -EINTR; > break; > } With the race meaning that thread2 never gets woken up due to the exclusive wakeup being caught by thread1 (which doesn't actually take the lock). I think that this bug was introduced by commit 62906027091f ("mm: add PageWaiters indicating tasks are waiting for a page bit"), which changed the page lock from using the wait_on_bit_lock() code to its own _slightly_ different version. Because it looks like _almost_ the same thing existed in the old wait_on_bit_lock() code - and that is still used by a couple of filesystems. *Most* of the users seem to use TASK_UNINTERRUPTIBLE, which is fine. But cifs and the sunrpc XPRT_LOCKED code both use the TASK_KILLABLE form that would seem to have the exact same issue: wait_on_bit_lock() uses exclusive wait-queues, but then may return with an error without actually taking the lock. Now, it turns out that I think the wait_on_bit_lock() code is actually safe, for a subtle reason. Why? Unlike the page lock code, the wait_on_bit_lock() code always tries to get the lock bit before returning an error. So wait_on_bit_lock() will prefer a successful lock taking over EINTR, which means that if the bit really was unlocked, it would have been taken. And if something else locked the bit again under us and we didn't get it, that "something else" presumably will then wake things up when it unlocks. So the wait_on_bit_lock() code could _also_ lose the wakeup event, but it would only lose it in situations where somebody else would then re-send it. Do people agree with that analysis? So I think the old wait_on_bit_lock() code ends up being safe - despite having this same pattern of "exclusive wait but might error out without taking the lock". Whether that "wait_on_bit_lock() is safe" was just a fluke or was because people thought about it is unclear. It's old code. People probably *did* think about it. I really can't remember. But it does point to a fix for the page lock case: just move the signal_pending_state() test to after the bit checking. So the page locking code is racy because you could have this: - another cpu does the unlock_page() and wakes up the process (and uses the exclusive event) - we then get a lethal signal before we get toi the "signal_pending_state()" state. - we end up prioritizing the lethal signal, because obviously we don't care about locking the page any more. - so now the lock bit may be still clear and there's nobody who is going to wake up the remaining waiter Moving the signal_pending_state() down actually fixes the race, because we know that in order for the exclusive thing to have mattered, it *has* to actually wake us up. So the unlock_page() must happen before the lethal signal (where before is well-defined because of that "try_to_wake_up()" taking a lock and looking at the task state). The exclusive accounting is only done if the process is actually woken up, not if it was already running (see "try_to_wake_up()"). And if the unlock_page() happened before the lethal signal, then we know that test_and_set_bit_lock() will either work (in which case we're ok), or another locker successfully came in later - in which case we're _also_ ok, because that other locker will then do the unlock again, and wake up subsequent waiters that might have been blocked by our first exclusive waiter. So I propose that the fix might be as simple as this: diff --git a/mm/filemap.c b/mm/filemap.c index baba290c276b..0b41c8cbeabc 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -986,10 +986,6 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q, if (likely(test_bit(bit_nr, &page->flags))) { io_schedule(); - if (unlikely(signal_pending_state(state, current))) { - ret = -EINTR; - break; - } } if (lock) { @@ -999,6 +995,11 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q, if (!test_bit(bit_nr, &page->flags)) break; } + + if (unlikely(signal_pending_state(state, current))) { + ret = -EINTR; + break; + } } finish_wait(q, wait); but maybe I'm missing something. Nick, comments? Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>