On Thu, Jul 23, 2020 at 5:47 AM Oleg Nesterov <oleg@xxxxxxxxxx> wrote: > > I still can't convince myself thatI fully understand this patch but I see > nothing really wrong after a quick glance... I guess my comments should be extended further then. Is there anything in particular you think is unclear? > > + wake_up_state(target, mode); > > We can no longer use 'target'. If it was already woken up it can notice > list_empty_careful(), return without taking q->lock, and exit. Good point. And yes, I think using WQ_FLAG_WOKEN is the right thing to do, and I wouldn't call it "abuse". It's exactly what it's for. And that also allows us to just use finish_wait(), since we no longer care as deeply about the waitlist state, we can just test that WQ_FLAG_WOKEN at the end instead. So that actually makes the patch much more straightforward too. I really disliked my open-coding there. Your suggestion fixes everything. > do we need SetPageWaiters() if trylock() succeeds ? We need to set it before the final page flag test, because otherwise we might miss somebody just about to wake us up (ie we see the bit set, but it's getting cleared on another CPU, and if PageWaiters isn't set then that other CPU won't do the wakeup). So here's a v2, now as a "real" commit with a commit message and everything. Is there anything in particular you would like clarified, or something else you find in this? Hugh - this should make your "patch 2" redundant. Is there any way to test that in your environment that triggered it? This v2 isn't tested, but the core of it is the same, just with nice cleanups from Oleg's suggestion, and an added comment about that SetPageWaiters() thing. Linus
From 71c48f2e50fcd56b6e941b46c0ea7dddd30ad146 Mon Sep 17 00:00:00 2001 From: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> Date: Thu, 23 Jul 2020 10:16:49 -0700 Subject: [PATCH] mm: rewrite wait_on_page_bit_common() logic It turns out that wait_on_page_bit_common() had several problems, ranging from just unfair behavioe due to re-queueing at the end of the wait queue when re-trying, and an outright bug that could result in missed wakeups (but probably never happened in practice). This rewrites the whole logic to avoid both issues, by simply moving the logic to check (and possibly take) the bit lock into the wakeup path instead. That makes everything much more straightforward, and means that we never need to re-queue the wait entry: if we get woken up, we'll be notified through WQ_FLAG_WOKEN, and the wait queue entry will have been removed, and everything will have been done for us. Link: https://lore.kernel.org/lkml/CAHk-=wjJA2Z3kUFb-5s=6+n0qbTs8ELqKFt9B3pH85a8fGD73w@xxxxxxxxxxxxxx/ Link: https://lore.kernel.org/lkml/alpine.LSU.2.11.2007221359450.1017@eggly.anvils/ Reported-by: Oleg Nesterov <oleg@xxxxxxxxxx> Reported-by: Hugh Dickins <hughd@xxxxxxxxxx> Cc: Michal Hocko <mhocko@xxxxxxxx> Signed-off-by: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> --- mm/filemap.c | 119 ++++++++++++++++++++++++++++++--------------------- 1 file changed, 71 insertions(+), 48 deletions(-) diff --git a/mm/filemap.c b/mm/filemap.c index 385759c4ce4b..37f642c07106 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1002,6 +1002,7 @@ struct wait_page_queue { static int wake_page_function(wait_queue_entry_t *wait, unsigned mode, int sync, void *arg) { + int ret; struct wait_page_key *key = arg; struct wait_page_queue *wait_page = container_of(wait, struct wait_page_queue, wait); @@ -1013,18 +1014,40 @@ static int wake_page_function(wait_queue_entry_t *wait, unsigned mode, int sync, if (wait_page->bit_nr != key->bit_nr) return 0; + /* Stop walking if it's locked */ + if (wait->flags & WQ_FLAG_EXCLUSIVE) { + if (test_and_set_bit(key->bit_nr, &key->page->flags)) + return -1; + } else { + if (test_bit(key->bit_nr, &key->page->flags)) + return -1; + } + /* - * Stop walking if it's locked. - * Is this safe if put_and_wait_on_page_locked() is in use? - * Yes: the waker must hold a reference to this page, and if PG_locked - * has now already been set by another task, that task must also hold - * a reference to the *same usage* of this page; so there is no need - * to walk on to wake even the put_and_wait_on_page_locked() callers. + * Let the waiter know we have done the page flag + * handling for it (and the return value lets the + * wakeup logic count exclusive wakeup events). */ - if (test_bit(key->bit_nr, &key->page->flags)) - return -1; + ret = (wait->flags & WQ_FLAG_EXCLUSIVE) != 0; + wait->flags |= WQ_FLAG_WOKEN; + wake_up_state(wait->private, mode); - return autoremove_wake_function(wait, mode, sync, key); + /* + * Ok, we have successfully done what we're waiting for, + * and we can unconditionally remove the wait entry. + * + * Note that this has to be the absolute last thing we do, + * since after list_del_init(&wait->entry) the wait entry + * might be de-allocated and the process might even have + * exited. + * + * We _really_ should have a "list_del_init_careful()" to + * properly pair with the unlocked "list_empty_careful()" + * in finish_wait(). + */ + smp_mb(); + list_del_init(&wait->entry); + return ret; } static void wake_up_page_bit(struct page *page, int bit_nr) @@ -1103,16 +1126,22 @@ enum behavior { */ }; +static inline int trylock_page_bit_common(struct page *page, int bit_nr, + enum behavior behavior) +{ + return behavior == EXCLUSIVE ? + !test_and_set_bit(bit_nr, &page->flags) : + !test_bit(bit_nr, &page->flags); +} + static inline int wait_on_page_bit_common(wait_queue_head_t *q, struct page *page, int bit_nr, int state, enum behavior behavior) { struct wait_page_queue wait_page; wait_queue_entry_t *wait = &wait_page.wait; - bool bit_is_set; bool thrashing = false; bool delayacct = false; unsigned long pflags; - int ret = 0; if (bit_nr == PG_locked && !PageUptodate(page) && PageWorkingset(page)) { @@ -1130,48 +1159,42 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q, wait_page.page = page; wait_page.bit_nr = bit_nr; + /* + * Add ourselves to the wait queue. + * + * NOTE! This is where we also check the page + * state synchronously the last time to see that + * somebody didn't just clear the bit. Do the + * SetPageWaiters() before that to let anybody + * we just miss know they need to wake us up. + */ + spin_lock_irq(&q->lock); + SetPageWaiters(page); + if (!trylock_page_bit_common(page, bit_nr, behavior)) + __add_wait_queue_entry_tail(q, wait); + spin_unlock_irq(&q->lock); + + /* + * From now on, all the logic will be based on + * whether the wait entry is on the queue or not, + * and the page bit testing (and setting) will be + * done by the wake function, not us. + * + * We can drop our reference to the page. + */ + if (behavior == DROP) + put_page(page); + for (;;) { - spin_lock_irq(&q->lock); - - if (likely(list_empty(&wait->entry))) { - __add_wait_queue_entry_tail(q, wait); - SetPageWaiters(page); - } - set_current_state(state); - spin_unlock_irq(&q->lock); - - bit_is_set = test_bit(bit_nr, &page->flags); - if (behavior == DROP) - put_page(page); - - if (likely(bit_is_set)) - io_schedule(); - - if (behavior == EXCLUSIVE) { - if (!test_and_set_bit_lock(bit_nr, &page->flags)) - break; - } else if (behavior == SHARED) { - if (!test_bit(bit_nr, &page->flags)) - break; - } - - if (signal_pending_state(state, current)) { - ret = -EINTR; + if (signal_pending_state(state, current)) break; - } - if (behavior == DROP) { - /* - * We can no longer safely access page->flags: - * even if CONFIG_MEMORY_HOTREMOVE is not enabled, - * there is a risk of waiting forever on a page reused - * for something that keeps it locked indefinitely. - * But best check for -EINTR above before breaking. - */ + if (wait->flags & WQ_FLAG_WOKEN) break; - } + + io_schedule(); } finish_wait(q, wait); @@ -1190,7 +1213,7 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q, * bother with signals either. */ - return ret; + return wait->flags & WQ_FLAG_WOKEN ? 0 : -EINTR; } void wait_on_page_bit(struct page *page, int bit_nr) -- 2.28.0.rc0.3.g1e25d3a62f