On Wed, Mar 16, 2022 at 08:59:39PM +0000, Matthew Wilcox wrote: > On Tue, Mar 15, 2022 at 03:07:10PM -0400, Brian Foster wrote: > > What seems to happen is that the majority of the fsync calls end up > > waiting on writeback of a particular page, the wakeup of the writeback > > bit on that page wakes a task that immediately resets PG_writeback on > > the page such that N other folio_wait_writeback() waiters see the bit > > still set and immediately place themselves back onto the tail of the > > wait queue. Meanwhile the waker task spins in the WQ_FLAG_BOOKMARK loop > > in folio_wake_bit() (backing off the lock for a cycle or so in each > > iteration) only to find the same bunch of tasks in the queue. This > > process repeats for a long enough amount of time to trigger the soft > > lockup warning. I've confirmed this spinning behavior with a tracepoint > > in the bookmark loop that indicates we're stuck for many hundreds of > > thousands of iterations (at least) of this loop when the soft lockup > > warning triggers. > > [...] > > > I've run a few quick experiments to try and corroborate this analysis. > > The problem goes away completely if I either back out the loop change in > > folio_wait_writeback() or bump WAITQUEUE_WALK_BREAK_CNT to something > > like 128 (i.e. greater than the total possible number of waiter tasks in > > this test). I've also played a few games with bookmark behavior mostly > > out of curiosity, but usually end up introducing other problems like > > missed wakeups, etc. > > As I recall, the bookmark hack was introduced in order to handle > lock_page() problems. It wasn't really supposed to handle writeback, > but nobody thought it would cause any harm (and indeed, it didn't at the > time). So how about we only use bookmarks for lock_page(), since > lock_page() usually doesn't have the multiple-waker semantics that > writeback has? > Oh, interesting. I wasn't aware of the tenuous status of the bookmark code. This is indeed much nicer than anything I was playing with. I suspect it will address the problem, but I'll throw it at my test env for a while and follow up.. thanks! Brian > (this is more in the spirit of "minimal patch" -- I think initialising > the bookmark should be moved to folio_unlock()). > > diff --git a/mm/filemap.c b/mm/filemap.c > index b2728eb52407..9ee3c5f1f489 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -1146,26 +1146,28 @@ static int wake_page_function(wait_queue_entry_t *wait, unsigned mode, int sync, > return (flags & WQ_FLAG_EXCLUSIVE) != 0; > } > > -static void folio_wake_bit(struct folio *folio, int bit_nr) > +static void folio_wake_bit(struct folio *folio, int bit_nr, > + wait_queue_entry_t *bookmark) > { > wait_queue_head_t *q = folio_waitqueue(folio); > struct wait_page_key key; > unsigned long flags; > - wait_queue_entry_t bookmark; > > key.folio = folio; > key.bit_nr = bit_nr; > key.page_match = 0; > > - bookmark.flags = 0; > - bookmark.private = NULL; > - bookmark.func = NULL; > - INIT_LIST_HEAD(&bookmark.entry); > + if (bookmark) { > + bookmark->flags = 0; > + bookmark->private = NULL; > + bookmark->func = NULL; > + INIT_LIST_HEAD(&bookmark->entry); > + } > > spin_lock_irqsave(&q->lock, flags); > - __wake_up_locked_key_bookmark(q, TASK_NORMAL, &key, &bookmark); > + __wake_up_locked_key_bookmark(q, TASK_NORMAL, &key, bookmark); > > - while (bookmark.flags & WQ_FLAG_BOOKMARK) { > + while (bookmark && (bookmark->flags & WQ_FLAG_BOOKMARK)) { > /* > * Take a breather from holding the lock, > * allow pages that finish wake up asynchronously > @@ -1175,7 +1177,7 @@ static void folio_wake_bit(struct folio *folio, int bit_nr) > spin_unlock_irqrestore(&q->lock, flags); > cpu_relax(); > spin_lock_irqsave(&q->lock, flags); > - __wake_up_locked_key_bookmark(q, TASK_NORMAL, &key, &bookmark); > + __wake_up_locked_key_bookmark(q, TASK_NORMAL, &key, bookmark); > } > > /* > @@ -1204,7 +1206,7 @@ static void folio_wake(struct folio *folio, int bit) > { > if (!folio_test_waiters(folio)) > return; > - folio_wake_bit(folio, bit); > + folio_wake_bit(folio, bit, NULL); > } > > /* > @@ -1554,12 +1556,15 @@ static inline bool clear_bit_unlock_is_negative_byte(long nr, volatile void *mem > */ > void folio_unlock(struct folio *folio) > { > + wait_queue_entry_t bookmark; > + > /* Bit 7 allows x86 to check the byte's sign bit */ > BUILD_BUG_ON(PG_waiters != 7); > BUILD_BUG_ON(PG_locked > 7); > VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); > + > if (clear_bit_unlock_is_negative_byte(PG_locked, folio_flags(folio, 0))) > - folio_wake_bit(folio, PG_locked); > + folio_wake_bit(folio, PG_locked, &bookmark); > } > EXPORT_SYMBOL(folio_unlock); > > @@ -1578,7 +1583,7 @@ void folio_end_private_2(struct folio *folio) > { > VM_BUG_ON_FOLIO(!folio_test_private_2(folio), folio); > clear_bit_unlock(PG_private_2, folio_flags(folio, 0)); > - folio_wake_bit(folio, PG_private_2); > + folio_wake_bit(folio, PG_private_2, NULL); > folio_put(folio); > } > EXPORT_SYMBOL(folio_end_private_2); >