On Thu, Mar 17, 2022 at 09:51:44AM -0400, Brian Foster wrote: > On Wed, Mar 16, 2022 at 08:59:39PM +0000, Matthew Wilcox wrote: > > On Tue, Mar 15, 2022 at 03:07:10PM -0400, Brian Foster wrote: > > > What seems to happen is that the majority of the fsync calls end up > > > waiting on writeback of a particular page, the wakeup of the writeback > > > bit on that page wakes a task that immediately resets PG_writeback on > > > the page such that N other folio_wait_writeback() waiters see the bit > > > still set and immediately place themselves back onto the tail of the > > > wait queue. Meanwhile the waker task spins in the WQ_FLAG_BOOKMARK loop > > > in folio_wake_bit() (backing off the lock for a cycle or so in each > > > iteration) only to find the same bunch of tasks in the queue. This > > > process repeats for a long enough amount of time to trigger the soft > > > lockup warning. I've confirmed this spinning behavior with a tracepoint > > > in the bookmark loop that indicates we're stuck for many hundreds of > > > thousands of iterations (at least) of this loop when the soft lockup > > > warning triggers. > > > > [...] > > > > > I've run a few quick experiments to try and corroborate this analysis. > > > The problem goes away completely if I either back out the loop change in > > > folio_wait_writeback() or bump WAITQUEUE_WALK_BREAK_CNT to something > > > like 128 (i.e. greater than the total possible number of waiter tasks in > > > this test). I've also played a few games with bookmark behavior mostly > > > out of curiosity, but usually end up introducing other problems like > > > missed wakeups, etc. > > > > As I recall, the bookmark hack was introduced in order to handle > > lock_page() problems. It wasn't really supposed to handle writeback, > > but nobody thought it would cause any harm (and indeed, it didn't at the > > time). So how about we only use bookmarks for lock_page(), since > > lock_page() usually doesn't have the multiple-waker semantics that > > writeback has? > > > > Oh, interesting. I wasn't aware of the tenuous status of the bookmark > code. This is indeed much nicer than anything I was playing with. I > suspect it will address the problem, but I'll throw it at my test env > for a while and follow up.. thanks! > So I'm not clear on where we're at with this patch vs. something that moves (or drops) the wb wait loop vs. the wait and set thing (which seems more invasive and longer term), but FWIW.. this patch survived over 10k iterations of the reproducer test yesterday (the problem typically reproduces in ~1k or so iterations on average) and an overnight fstests run without regression. Brian > Brian > > > (this is more in the spirit of "minimal patch" -- I think > > initialising the bookmark should be moved to folio_unlock()). > > > > diff --git a/mm/filemap.c b/mm/filemap.c index > > b2728eb52407..9ee3c5f1f489 100644 > > --- a/mm/filemap.c > > +++ b/mm/filemap.c > > @@ -1146,26 +1146,28 @@ static int wake_page_function(wait_queue_entry_t *wait, unsigned mode, int sync, > > return (flags & WQ_FLAG_EXCLUSIVE) != 0; > > } > > > > -static void folio_wake_bit(struct folio *folio, int bit_nr) > > +static void folio_wake_bit(struct folio *folio, int bit_nr, > > + wait_queue_entry_t *bookmark) > > { > > wait_queue_head_t *q = folio_waitqueue(folio); > > struct wait_page_key key; > > unsigned long flags; > > - wait_queue_entry_t bookmark; > > > > key.folio = folio; > > key.bit_nr = bit_nr; > > key.page_match = 0; > > > > - bookmark.flags = 0; > > - bookmark.private = NULL; > > - bookmark.func = NULL; > > - INIT_LIST_HEAD(&bookmark.entry); > > + if (bookmark) { > > + bookmark->flags = 0; > > + bookmark->private = NULL; > > + bookmark->func = NULL; > > + INIT_LIST_HEAD(&bookmark->entry); > > + } > > > > spin_lock_irqsave(&q->lock, flags); > > - __wake_up_locked_key_bookmark(q, TASK_NORMAL, &key, &bookmark); > > + __wake_up_locked_key_bookmark(q, TASK_NORMAL, &key, bookmark); > > > > - while (bookmark.flags & WQ_FLAG_BOOKMARK) { > > + while (bookmark && (bookmark->flags & WQ_FLAG_BOOKMARK)) { > > /* > > * Take a breather from holding the lock, > > * allow pages that finish wake up asynchronously > > @@ -1175,7 +1177,7 @@ static void folio_wake_bit(struct folio *folio, int bit_nr) > > spin_unlock_irqrestore(&q->lock, flags); > > cpu_relax(); > > spin_lock_irqsave(&q->lock, flags); > > - __wake_up_locked_key_bookmark(q, TASK_NORMAL, &key, &bookmark); > > + __wake_up_locked_key_bookmark(q, TASK_NORMAL, &key, bookmark); > > } > > > > /* > > @@ -1204,7 +1206,7 @@ static void folio_wake(struct folio *folio, int bit) > > { > > if (!folio_test_waiters(folio)) > > return; > > - folio_wake_bit(folio, bit); > > + folio_wake_bit(folio, bit, NULL); > > } > > > > /* > > @@ -1554,12 +1556,15 @@ static inline bool clear_bit_unlock_is_negative_byte(long nr, volatile void *mem > > */ > > void folio_unlock(struct folio *folio) > > { > > + wait_queue_entry_t bookmark; > > + > > /* Bit 7 allows x86 to check the byte's sign bit */ > > BUILD_BUG_ON(PG_waiters != 7); > > BUILD_BUG_ON(PG_locked > 7); > > VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); > > + > > if (clear_bit_unlock_is_negative_byte(PG_locked, folio_flags(folio, 0))) > > - folio_wake_bit(folio, PG_locked); > > + folio_wake_bit(folio, PG_locked, &bookmark); > > } > > EXPORT_SYMBOL(folio_unlock); > > > > @@ -1578,7 +1583,7 @@ void folio_end_private_2(struct folio *folio) > > { > > VM_BUG_ON_FOLIO(!folio_test_private_2(folio), folio); > > clear_bit_unlock(PG_private_2, folio_flags(folio, 0)); > > - folio_wake_bit(folio, PG_private_2); > > + folio_wake_bit(folio, PG_private_2, NULL); > > folio_put(folio); > > } > > EXPORT_SYMBOL(folio_end_private_2); > >