Re: Known and unfixed active data loss bug in MM + XFS with large folios since Dec 2021 (any kernel from 6.1 upwards)

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Mon, 30 Sep 2024 16:53:05 -0700

On Mon, 30 Sept 2024 at 13:57, Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:
>
> Could we break out if folio->mapping has changed?  Clearly if it has,
> we're no longer waiting for the folio we thought we were waiting for,
> but for a folio which now belongs to a different file.

Sounds like a sane check to me, but it's also not clear that this
would make any difference.

The most likely reason for starvation I can see is a slow thread
(possibly due to cgroup throttling like Christian alluded to) would
simply be continually unlucky, because every time it gets woken up,
some other thread has already dirtied the data and caused writeback
again.

I would think that kind of behavior (perhaps some DB transaction
header kind of folio) would be more likely than the mapping changing
(and then remaining under writeback for some other mapping).

But I really don't know.

I would much prefer to limit the folio_wait_bit() loop based on something else.

For example, the basic reason for that loop (unless there is some
other hidden one) is that the folio writeback bit is not atomic wrt
the wakeup. Maybe we could *make* it atomic, by simply taking the
folio waitqueue lock before clearing the bit?

(Only if it has the "waiters" bit set, of course!)

Handwavy.

Anyway, this writeback handling is nasty. folio_end_writeback() has a
big comment about the subtle folio reference issue too, and ignoring
that we also have this:

        if (__folio_end_writeback(folio))
                folio_wake_bit(folio, PG_writeback);

(which is the cause of the non-atomicity: __folio_end_writeback() will
clear the bit, and return the "did we have waiters", and then
folio_wake_bit() will get the waitqueue lock and wake people up).

And notice how __folio_end_writeback() clears the bit with

                ret = folio_xor_flags_has_waiters(folio, 1 << PG_writeback);

which does that "clear bit and look it it had waiters" atomically. But
that function then has a comment that says

 * This must only be used for flags which are changed with the folio
 * lock held.  For example, it is unsafe to use for PG_dirty as that
 * can be set without the folio lock held.  [...]

but the code that uses it here does *NOT* hold the folio lock.

I think the comment is wrong, and the code is fine (the important
point is that the folio lock _serialized_ the writers, and while
clearing doesn't hold the folio lock, you can't clear it without
setting it, and setting the writeback flag *does* hold the folio
lock).

So my point is not that this code is wrong, but that this code is all
kinds of subtle and complex. I think it would be good to change the
rules so that we serialize with waiters, but being complex and subtle
means it sounds all kinds of nasty.

            Linus