On Mon, 11 Jan 2021, Greg Kroah-Hartman wrote: > From: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> > > commit c2407cf7d22d0c0d94cf20342b3b8f06f1d904e7 upstream. > > Ever since commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() > logic") we've had some very occasional reports of BUG_ON(PageWriteback) > in write_cache_pages(), which we thought we already fixed in commit > 073861ed77b6 ("mm: fix VM_BUG_ON(PageTail) and BUG_ON(PageWriteback)"). > > But syzbot just reported another one, even with that commit in place. > > And it turns out that there's a simpler way to trigger the BUG_ON() than > the one Hugh found with page re-use. It all boils down to the fact that > the page writeback is ostensibly serialized by the page lock, but that > isn't actually really true. > > Yes, the people _setting_ writeback all do so under the page lock, but > the actual clearing of the bit - and waking up any waiters - happens > without any page lock. > > This gives us this fairly simple race condition: > > CPU1 = end previous writeback > CPU2 = start new writeback under page lock > CPU3 = write_cache_pages() > > CPU1 CPU2 CPU3 > ---- ---- ---- > > end_page_writeback() > test_clear_page_writeback(page) > ... delayed... > > lock_page(); > set_page_writeback() > unlock_page() > > lock_page() > wait_on_page_writeback(); > > wake_up_page(page, PG_writeback); > .. wakes up CPU3 .. > > BUG_ON(PageWriteback(page)); > > where the BUG_ON() happens because we woke up the PG_writeback bit > becasue of the _previous_ writeback, but a new one had already been > started because the clearing of the bit wasn't actually atomic wrt the > actual wakeup or serialized by the page lock. > > The reason this didn't use to happen was that the old logic in waiting > on a page bit would just loop if it ever saw the bit set again. > > The nice proper fix would probably be to get rid of the whole "wait for > writeback to clear, and then set it" logic in the writeback path, and > replace it with an atomic "wait-to-set" (ie the same as we have for page > locking: we set the page lock bit with a single "lock_page()", not with > "wait for lock bit to clear and then set it"). > > However, out current model for writeback is that the waiting for the > writeback bit is done by the generic VFS code (ie write_cache_pages()), > but the actual setting of the writeback bit is done much later by the > filesystem ".writepages()" function. > > IOW, to make the writeback bit have that same kind of "wait-to-set" > behavior as we have for page locking, we'd have to change our roughly > ~50 different writeback functions. Painful. > > Instead, just make "wait_on_page_writeback()" loop on the very unlikely > situation that the PG_writeback bit is still set, basically re-instating > the old behavior. This is very non-optimal in case of contention, but > since we only ever set the bit under the page lock, that situation is > controlled. > > Reported-by: syzbot+2fc0712f8f8b8b8fa0ef@xxxxxxxxxxxxxxxxxxxxxxxxx > Fixes: 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic") > Acked-by: Hugh Dickins <hughd@xxxxxxxxxx> > Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> > Cc: Matthew Wilcox <willy@xxxxxxxxxxxxx> > Cc: stable@xxxxxxxxxx > Signed-off-by: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> > Signed-off-by: Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxxx> I think it's too early to push this one through to stable: Linus mentioned on Friday that Michael Larabel of Phoronix has observed a performance regression from this commit. Correctness outweighs performance of course, but I think stable users might see the performance issue much sooner than they would ever see the BUG fixed. Wait a bit, while we think some more about what to try next? Hugh > > --- > mm/page-writeback.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > --- a/mm/page-writeback.c > +++ b/mm/page-writeback.c > @@ -2826,7 +2826,7 @@ EXPORT_SYMBOL(__test_set_page_writeback) > */ > void wait_on_page_writeback(struct page *page) > { > - if (PageWriteback(page)) { > + while (PageWriteback(page)) { > trace_wait_on_page_writeback(page, page_mapping(page)); > wait_on_page_bit(page, PG_writeback); > }