Hi, sorry for being late in discussion (was offline or very busy with other stuff). I hope I got it right and this is the latest version of your patches. Btw. do you still think that increasing PAGE_WAIT_TABLE_BITS is reasonable. In the meantime I have learned that the customer suffering from the issue is very unlikely to reboot the machine anytime soon or even willing to test a non-trivial patch. We do not own any machine which exhibit this problem unfortunately. So it is quite unlikely I can help with testing. Also does it make sense to put this into mmotm tree for a while to get a larger testing coverage? I didn't get to review these patch and it will take some time to do that because this is really subtle area. Thanks for helping out and sorry that I cannot really help much more now. On Fri 24-07-20 16:25:56, Linus Torvalds wrote: > From 0bccb60841cc52a9aa6e9cc6b7eff59d1983e8fa Mon Sep 17 00:00:00 2001 > From: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> > Date: Thu, 23 Jul 2020 10:16:49 -0700 > Subject: [PATCH 1/2] mm: rewrite wait_on_page_bit_common() logic > > It turns out that wait_on_page_bit_common() had several problems, > ranging from just unfair behavioe due to re-queueing at the end of the > wait queue when re-trying, and an outright bug that could result in > missed wakeups (but probably never happened in practice). > > This rewrites the whole logic to avoid both issues, by simply moving the > logic to check (and possibly take) the bit lock into the wakeup path > instead. > > That makes everything much more straightforward, and means that we never > need to re-queue the wait entry: if we get woken up, we'll be notified > through WQ_FLAG_WOKEN, and the wait queue entry will have been removed, > and everything will have been done for us. > > Link: https://lore.kernel.org/lkml/CAHk-=wjJA2Z3kUFb-5s=6+n0qbTs8ELqKFt9B3pH85a8fGD73w@xxxxxxxxxxxxxx/ > Link: https://lore.kernel.org/lkml/alpine.LSU.2.11.2007221359450.1017@eggly.anvils/ > Reported-by: Oleg Nesterov <oleg@xxxxxxxxxx> > Reported-by: Hugh Dickins <hughd@xxxxxxxxxx> > Cc: Michal Hocko <mhocko@xxxxxxxx> > Reviewed-by: Oleg Nesterov <oleg@xxxxxxxxxx> > Signed-off-by: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> > --- > mm/filemap.c | 132 +++++++++++++++++++++++++++++++++------------------ > 1 file changed, 85 insertions(+), 47 deletions(-) > > diff --git a/mm/filemap.c b/mm/filemap.c > index 385759c4ce4b..8c3d3e233d37 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -1002,6 +1002,7 @@ struct wait_page_queue { > > static int wake_page_function(wait_queue_entry_t *wait, unsigned mode, int sync, void *arg) > { > + int ret; > struct wait_page_key *key = arg; > struct wait_page_queue *wait_page > = container_of(wait, struct wait_page_queue, wait); > @@ -1014,17 +1015,40 @@ static int wake_page_function(wait_queue_entry_t *wait, unsigned mode, int sync, > return 0; > > /* > - * Stop walking if it's locked. > - * Is this safe if put_and_wait_on_page_locked() is in use? > - * Yes: the waker must hold a reference to this page, and if PG_locked > - * has now already been set by another task, that task must also hold > - * a reference to the *same usage* of this page; so there is no need > - * to walk on to wake even the put_and_wait_on_page_locked() callers. > + * If it's an exclusive wait, we get the bit for it, and > + * stop walking if we can't. > + * > + * If it's a non-exclusive wait, then the fact that this > + * wake function was called means that the bit already > + * was cleared, and we don't care if somebody then > + * re-took it. > */ > - if (test_bit(key->bit_nr, &key->page->flags)) > - return -1; > + ret = 0; > + if (wait->flags & WQ_FLAG_EXCLUSIVE) { > + if (test_and_set_bit(key->bit_nr, &key->page->flags)) > + return -1; > + ret = 1; > + } > + wait->flags |= WQ_FLAG_WOKEN; > > - return autoremove_wake_function(wait, mode, sync, key); > + wake_up_state(wait->private, mode); > + > + /* > + * Ok, we have successfully done what we're waiting for, > + * and we can unconditionally remove the wait entry. > + * > + * Note that this has to be the absolute last thing we do, > + * since after list_del_init(&wait->entry) the wait entry > + * might be de-allocated and the process might even have > + * exited. > + * > + * We _really_ should have a "list_del_init_careful()" to > + * properly pair with the unlocked "list_empty_careful()" > + * in finish_wait(). > + */ > + smp_mb(); > + list_del_init(&wait->entry); > + return ret; > } > > static void wake_up_page_bit(struct page *page, int bit_nr) > @@ -1103,16 +1127,31 @@ enum behavior { > */ > }; > > +/* > + * Attempt to check (or get) the page bit, and mark the > + * waiter woken if successful. > + */ > +static inline bool trylock_page_bit_common(struct page *page, int bit_nr, > + struct wait_queue_entry *wait) > +{ > + if (wait->flags & WQ_FLAG_EXCLUSIVE) { > + if (test_and_set_bit(bit_nr, &page->flags)) > + return false; > + } else if (test_bit(bit_nr, &page->flags)) > + return false; > + > + wait->flags |= WQ_FLAG_WOKEN; > + return true; > +} > + > static inline int wait_on_page_bit_common(wait_queue_head_t *q, > struct page *page, int bit_nr, int state, enum behavior behavior) > { > struct wait_page_queue wait_page; > wait_queue_entry_t *wait = &wait_page.wait; > - bool bit_is_set; > bool thrashing = false; > bool delayacct = false; > unsigned long pflags; > - int ret = 0; > > if (bit_nr == PG_locked && > !PageUptodate(page) && PageWorkingset(page)) { > @@ -1130,48 +1169,47 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q, > wait_page.page = page; > wait_page.bit_nr = bit_nr; > > - for (;;) { > - spin_lock_irq(&q->lock); > + /* > + * Do one last check whether we can get the > + * page bit synchronously. > + * > + * Do the SetPageWaiters() marking before that > + * to let any waker we _just_ missed know they > + * need to wake us up (otherwise they'll never > + * even go to the slow case that looks at the > + * page queue), and add ourselves to the wait > + * queue if we need to sleep. > + * > + * This part needs to be done under the queue > + * lock to avoid races. > + */ > + spin_lock_irq(&q->lock); > + SetPageWaiters(page); > + if (!trylock_page_bit_common(page, bit_nr, wait)) > + __add_wait_queue_entry_tail(q, wait); > + spin_unlock_irq(&q->lock); > > - if (likely(list_empty(&wait->entry))) { > - __add_wait_queue_entry_tail(q, wait); > - SetPageWaiters(page); > - } > + /* > + * From now on, all the logic will be based on > + * the WQ_FLAG_WOKEN flag, and the and the page > + * bit testing (and setting) will be - or has > + * already been - done by the wake function. > + * > + * We can drop our reference to the page. > + */ > + if (behavior == DROP) > + put_page(page); > > + for (;;) { > set_current_state(state); > > - spin_unlock_irq(&q->lock); > - > - bit_is_set = test_bit(bit_nr, &page->flags); > - if (behavior == DROP) > - put_page(page); > - > - if (likely(bit_is_set)) > - io_schedule(); > - > - if (behavior == EXCLUSIVE) { > - if (!test_and_set_bit_lock(bit_nr, &page->flags)) > - break; > - } else if (behavior == SHARED) { > - if (!test_bit(bit_nr, &page->flags)) > - break; > - } > - > - if (signal_pending_state(state, current)) { > - ret = -EINTR; > + if (signal_pending_state(state, current)) > break; > - } > > - if (behavior == DROP) { > - /* > - * We can no longer safely access page->flags: > - * even if CONFIG_MEMORY_HOTREMOVE is not enabled, > - * there is a risk of waiting forever on a page reused > - * for something that keeps it locked indefinitely. > - * But best check for -EINTR above before breaking. > - */ > + if (wait->flags & WQ_FLAG_WOKEN) > break; > - } > + > + io_schedule(); > } > > finish_wait(q, wait); > @@ -1190,7 +1228,7 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q, > * bother with signals either. > */ > > - return ret; > + return wait->flags & WQ_FLAG_WOKEN ? 0 : -EINTR; > } > > void wait_on_page_bit(struct page *page, int bit_nr) > -- > 2.28.0.rc0.3.g1e25d3a62f > > From 93f0263b9b952a1c449cec56a6aadf6320e821f9 Mon Sep 17 00:00:00 2001 > From: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> > Date: Thu, 23 Jul 2020 12:33:41 -0700 > Subject: [PATCH 2/2] list: add "list_del_init_careful()" to go with > "list_empty_careful()" > > That gives us ordering guarantees around the pair. > > Signed-off-by: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> > --- > include/linux/list.h | 20 +++++++++++++++++++- > kernel/sched/wait.c | 2 +- > mm/filemap.c | 7 +------ > 3 files changed, 21 insertions(+), 8 deletions(-) > > diff --git a/include/linux/list.h b/include/linux/list.h > index aff44d34f4e4..0d0d17a10d25 100644 > --- a/include/linux/list.h > +++ b/include/linux/list.h > @@ -282,6 +282,24 @@ static inline int list_empty(const struct list_head *head) > return READ_ONCE(head->next) == head; > } > > +/** > + * list_del_init_careful - deletes entry from list and reinitialize it. > + * @entry: the element to delete from the list. > + * > + * This is the same as list_del_init(), except designed to be used > + * together with list_empty_careful() in a way to guarantee ordering > + * of other memory operations. > + * > + * Any memory operations done before a list_del_init_careful() are > + * guaranteed to be visible after a list_empty_careful() test. > + */ > +static inline void list_del_init_careful(struct list_head *entry) > +{ > + __list_del_entry(entry); > + entry->prev = entry; > + smp_store_release(&entry->next, entry); > +} > + > /** > * list_empty_careful - tests whether a list is empty and not being modified > * @head: the list to test > @@ -297,7 +315,7 @@ static inline int list_empty(const struct list_head *head) > */ > static inline int list_empty_careful(const struct list_head *head) > { > - struct list_head *next = head->next; > + struct list_head *next = smp_load_acquire(&head->next); > return (next == head) && (next == head->prev); > } > > diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c > index ba059fbfc53a..01f5d3020589 100644 > --- a/kernel/sched/wait.c > +++ b/kernel/sched/wait.c > @@ -389,7 +389,7 @@ int autoremove_wake_function(struct wait_queue_entry *wq_entry, unsigned mode, i > int ret = default_wake_function(wq_entry, mode, sync, key); > > if (ret) > - list_del_init(&wq_entry->entry); > + list_del_init_careful(&wq_entry->entry); > > return ret; > } > diff --git a/mm/filemap.c b/mm/filemap.c > index 8c3d3e233d37..991503bbf922 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -1041,13 +1041,8 @@ static int wake_page_function(wait_queue_entry_t *wait, unsigned mode, int sync, > * since after list_del_init(&wait->entry) the wait entry > * might be de-allocated and the process might even have > * exited. > - * > - * We _really_ should have a "list_del_init_careful()" to > - * properly pair with the unlocked "list_empty_careful()" > - * in finish_wait(). > */ > - smp_mb(); > - list_del_init(&wait->entry); > + list_del_init_careful(&wait->entry); > return ret; > } > > -- > 2.28.0.rc0.3.g1e25d3a62f > -- Michal Hocko SUSE Labs