Re: [PATCH 2/2 v2] sched/wait: Introduce lock breaker in wake_up_page_bit

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Sun, 27 Aug 2017 14:40:48 -0700

On Sat, Aug 26, 2017 at 11:15 AM, Linus Torvalds
<torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
>
> So how about just this fairly trivial patch?

So I just committed that trivial patch, because I think it's right,
but more importantly because I think I found a real and non-trivial
fundamental problem.

The reason I found it is actually that I was thinking about this
patch, and how the WQ_FLAG_EXCLUSIVE ordering matters.

And I don't really think the WQ_FLAG_EXCLUSIVE ordering matters all
that much, but just *thinking* about it made me realize that the code
is broken.

In particular, this caller:

    int __lock_page_killable(struct page *__page)
    {
        struct page *page = compound_head(__page);
        wait_queue_head_t *q = page_waitqueue(page);
        return wait_on_page_bit_common(q, page, PG_locked, TASK_KILLABLE, true);
    }
    EXPORT_SYMBOL_GPL(__lock_page_killable);

is completely broken crap.

Why?

It's the combination of "TASK_KILLABLE" and "true" that is broken.
Always has been broken, afaik.

The "true" is that "bool lock" argument, and when it is set, we set
the WQ_FLAG_EXCLUSIVE bit.

But that bit - by definition, that's the whole point - means that the
waking side only wakes up *one* waiter.

So there's a race in anybody who uses __lock_page_killable().

The race goes like this:

  thread1       thread2         thread3
  ----          ----            ----

  .. CPU1 ...
  __lock_page_killable
    wait_on_page_bit_common()
      get wq lock
      __add_wait_queue_entry_tail_exclusive()
      set_current_state(TASK_KILLABLE);
      release wq lock
        io_schedule

                ... CPU2 ...
                __lock_page[_killable]()
                  wait_on_page_bit_common()
                    get wq lock
                    __add_wait_queue_entry_tail_exclusive()
                    set_current_state(TASK_KILLABLE);
                    release wq lock
                    io_schedule

                                ... CPU3 ...
                                unlock_page()
                                wake_up_page_bit(page, PG_Locked)
                                  wakes up CPU1 _only_

  ... lethal signal for thread1 happens ...
     if (unlikely(signal_pending_state(state, current))) {
          ret = -EINTR;
          break;
     }

End result: page is unlocked, CPU3 is waiting, nothing will wake CPU3 up.

Of course, if we have multiple threads all locking that page
concurrently, we probably will have *another* thread lock it
(successfully), and then when it unlocks it thread3 does get woken up
eventually.

But the keyword is "eventually". It could be a long while,
particularly if we don't lock the page *all* the time, just
occasionally.

So it might be a while, and it might explain how some waiters might queue up.

And who does __lock_page_killable? Page faults.

And who does a lot of page faults and page locking? That NUMA load from hell.

Does it have lethal signals, though? Probably not. That lethal signal
case really is unusual.

So I'm not saying that this is actually THE BUG. In particular,
despite that race, the page actually *is* unlocked afterwards. It's
just that one of the threads that wanted the lock didn't get notified
of it. So it doesn't really explain how non-locking waiters (ie the
people who don't do migrations, just wait for the migration entry)
would queue up.

But it sure looks like a real bug to me.

Basically, if you ask for anm exclusive wakeup, you *have* to take the
resource you are waiting for. Youl can't just say "never mind, I'll
return -EINTR".

I don't see a simple fix for this yet other than perhaps adding a
wakeup to the "ret = -EINTR; break" case.

Does anybody else see anything? Or maybe see a reason why this
wouldn't be a bug in the first place?

Anyway, I am officially starting to hate that page wait code.  I've
stared at it for days now, and I am not getting more fond of it.

                   Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>