Re: kernel BUG at mm/page-writeback.c:LINE!

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Mon, 4 Jan 2021 13:52:29 -0800

On Mon, Jan 4, 2021 at 12:41 PM Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> wrote:
>
> >
> > kernel BUG at mm/page-writeback.c:2241!
> > Call Trace:
> >  mpage_writepages+0xd8/0x230 fs/mpage.c:714
> >  do_writepages+0xec/0x290 mm/page-writeback.c:2352
> >  __filemap_fdatawrite_range+0x2a1/0x380 mm/filemap.c:422
> >  fat_cont_expand+0x169/0x230 fs/fat/file.c:235
> >  fat_setattr+0xac2/0xf40 fs/fat/file.c:501
> >  notify_change+0xb60/0x10a0 fs/attr.c:336
> >  do_truncate+0x134/0x1f0 fs/open.c:64
> >  do_sys_ftruncate+0x703/0x860 fs/open.c:195
> >  do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
> >  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> Well that's exciting. write_cache_pages() does:
>
>                         if (PageWriteback(page)) {
>                                 if (wbc->sync_mode != WB_SYNC_NONE)
>                                         wait_on_page_writeback(page);
>                                 else
>                                         goto continue_unlock;
>                         }
>
> bang -->>               BUG_ON(PageWriteback(page));

This is the same situation that was discussed earlier, and that we
thought Hugh commit 073861ed77b6 ("mm: fix VM_BUG_ON(PageTail) and
BUG_ON(PageWriteback)") should fix.

Basically, the theory was that the sequence

                if (PageWriteback(page))
                        wait_on_page_writeback(page);
                BUG_ON(PageWriteback(page));

could have the wake-up of the _previous_ owner of the page happen, and
wake up the wait_on_page_writeback() waiter, but then a new owner of
the page would re-allocate it and mark it for writeback.

> So either wait_on_page_writeback() simply failed to work (returned
> without waiting) or someone came in and unexpectedly set PG_writeback a
> second time.

So Hugh found at least _one_ way that that "someone came in and
unexpectedly set PG_writeback a second time" could happen.

But that fix by Hugh is already in that  HEAD commit that syzbot is
now reporting, so there's something else going on.

> Linus, how confident are you in those wait_on_page_bit_common()
> changes?

Pretty confident. The atomicity of the bitops themselves is fairly simple.

But in the writeback bit? No. The old code would basically _loop_ if
it was woken up and the writeback bit was set again, and would hide
any problems with it.

The new code basically goes "ok, the writeback bit was clear at one
point, so I've waited enough".

We could easily turn the "if ()" in wait_on_page_writeback() into a "while()".

But honestly, it does smell to me like the bug is always in the caller
not having serialized with whatever actually starts writeback. High
figured out one such case.

This code holds the page lock, but I don't see where
set_page_writeback() would always be done with the page lock held. So
what really protects against PG_writeback simply being set again?

The whole BUG_ON() seems entirely buggy to me.

In fact, even if you hold the page lock while doing
set_page_writeback(), since the actual IO does *NOT* hold the page
lock, the unlock happens without it. So even if every single case of
setting the page writeback were to hold the page lock, what keeps this
from happening:

CPU1 = end previous writeback
CPU2 = start new writeback under page lock
CPU3 = write_cache_pages()

  CPU1          CPU2            CPU3
  ----          ----            ----

  end_page_writeback()
    test_clear_page_writeback(page)
    ... delayed...

                lock_page();
                set_page_writeback()
                unlock_page()

                                lock_page()
                                wait_on_page_writeback();

    wake_up_page(page, PG_writeback);
    .. wakes up CPU3 ..

                                BUG_ON(PageWriteback(page));

IOW, that BUG_ON() really feels entirely bogus to me. Notice how it
wasn't actually serialized with the waking up of the _previous_ bit.

Could we make the wait_on_page_writeback() just loop if it sees the
page under writeback again? Sure.

Could we make the wait_on_page_bit_common() code say "if this is
PG_writeback, I won't wake it up after all, because the bit is set
again?" Sure.

But I feel it's really that end_page_writeback() itself is
fundamentally buggy, because the "wakeup" is not atomic with the bit
clearing _and_ it doesn't actually hold the page lock that is
allegedly serializing this all.

That raciness was what caused the "stale wakeup from previous owner"
thing too. And I think that Hugh fixed the page re-use case, but the
fundamental problem of end_page_writeback() kind of remained.

And yes, I think this was all hidden by wait_on_page_writeback()
effectively looping over the "PageWriteback(page)" test because of how
wait_on_page_bit() worked.

So the one-liner of changing the "if" to "while" in
wait_on_page_writeback() should get us back to what we used to do.

Except I still get the feeling that the bug really is not in
wait_on_page_writeback(), but in the end_page_writeback() side.

Comments? I'm perfectly happy doing the one-liner. I would just be
_happier_ with end_page_writeback() having the serialization..

The real problem is that "wake_up_page(page, bit)" is not the thing
that actually clears the bit. So there's a fundamental race between
clearing the bit and waking something up.

Which makes me think that the best option would actually be to move
the bit clearing _into_ wake_up_page(). But that looks like a very big
change.

                   Linus