Re: Known and unfixed active data loss bug in MM + XFS with large folios since Dec 2021 (any kernel from 6.1 upwards)

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Mon, 30 Sep 2024 11:46:30 -0700

On Mon, 30 Sept 2024 at 10:35, Christian Theune <ct@xxxxxxxxxxxxxxx> wrote:
>
> Sep 27 00:51:20 <redactedhostname>13 kernel:  folio_wait_bit_common+0x13f/0x340
> Sep 27 00:51:20 <redactedhostname>13 kernel:  folio_wait_writeback+0x2b/0x80

Gaah. Every single case you point to is that folio_wait_writeback() case.

And this might be an old old annoyance.

folio_wait_writeback() is insane. It does

        while (folio_test_writeback(folio)) {
                trace_folio_wait_writeback(folio, folio_mapping(folio));
                folio_wait_bit(folio, PG_writeback);
        }

and the reason that is insane is that PG_writeback isn't some kind of
exclusive state. So folio_wait_bit() will return once somebody has
ended writeback, but *new* writeback can easily have been started
afterwards. So then we go back to wait...

And even after it eventually returns (possibly after having waited for
hundreds of other processes writing back that folio - imagine lots of
other threads doing writes to it and 'fdatasync()' or whatever) the
caller *still* can't actually assume that the writeback bit is clear,
because somebody else might have started writeback again.

Anyway, it's insane, but it's insane for a *reason*. We've tried to
fix this before, long before it was a folio op. See commit
c2407cf7d22d ("mm: make wait_on_page_writeback() wait for multiple
pending writebacks").

IOW, this code is known-broken and might have extreme unfairness
issues (although I had blissfully forgotten about it), because while
the actual writeback *bit* itself is set and cleared atomically, the
wakeup for the bit is asynchronous and can be delayed almost
arbitrarily, so you can get basically spurious wakeups that were from
a previous bit clear.

So the "wait many times" is crazy, but it's sadly a necessary crazy as
things are right now.

Now, many callers hold the page lock while doing this, and in that
case new writeback cases shouldn't happen, and so repeating the loop
should be extremely limited.

But "many" is not "all". For example, __filemap_fdatawait_range() very
much doesn't hold the lock on the pages it waits for, so afaik this
can cause that unfairness and starvation issue.

That said, while every one of your traces are for that
folio_wait_writeback(), the last one is for the truncate case, and
that one *does* hold the page lock and so shouldn't see this potential
unfairness issue.

So the code here is questionable, and might cause some issues, but the
starvation of folio_wait_writeback() can't explain _all_ the cases you
see.

                  Linus