Re: Known and unfixed active data loss bug in MM + XFS with large folios since Dec 2021 (any kernel from 6.1 upwards)

Christian Theune <ct@xxxxxxxxxxxxxxx> · Mon, 16 Sep 2024 09:14:45 +0200

> On 16. Sep 2024, at 02:00, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> 
> On Thu, Sep 12, 2024 at 03:25:50PM -0700, Linus Torvalds wrote:
>> On Thu, 12 Sept 2024 at 15:12, Jens Axboe <axboe@xxxxxxxxx> wrote:
>> Honestly, the fact that it hasn't been reverted after apparently
>> people knowing about it for months is a bit shocking to me. Filesystem
>> people tend to take unknown corruption issues as a big deal. What
>> makes this so special? Is it because the XFS people don't consider it
>> an XFS issue, so...
> 
> I don't think this is a data corruption/loss problem - it certainly
> hasn't ever appeared that way to me.  The "data loss" appeared to be
> in incomplete postgres dump files after the system was rebooted and
> this is exactly what would happen when you randomly crash the
> system. i.e. dirty data in memory is lost, and application data
> being written at the time is in an inconsistent state after the
> system recovers. IOWs, there was no clear evidence of actual data
> corruption occuring, and data loss is definitely expected when the
> page cache iteration hangs and the system is forcibly rebooted
> without being able to sync or unmount the filesystems…
> All the hangs seem to be caused by folio lookup getting stuck
> on a rogue xarray entry in truncate or readahead. If we find an
> invalid entry or a folio from a different mapping or with a
> unexpected index, we skip it and try again.  Hence this does not
> appear to be a data corruption vector, either - it results in a
> livelock from endless retry because of the bad entry in the xarray.
> This endless retry livelock appears to be what is being reported.
> 
> IOWs, there is no evidence of real runtime data corruption or loss
> from this pagecache livelock bug.  We also haven't heard of any
> random file data corruption events since we've enabled large folios
> on XFS. Hence there really is no evidence to indicate that there is
> a large folio xarray lookup bug that results in data corruption in
> the existing code, and therefore there is no obvious reason for
> turning off the functionality we are already building significant
> new functionality on top of.

Right, understood. 

However, the timeline of one of the encounters with PostgreSQL (the first comment in Bugzilla) involved still makes me feel uneasy:

T0                   : one postgresql process blocked with a different trace (not involving xas_load)
T+a few minutes      : another process stuck with the relevant xas_load/descend trace
T+a few more minutes : other processes blocked in xas_load (this time the systemd journal)
T+14m                : the journal gets coredumped, likely due to some watchdog 

Things go back to normal.

T+14h                : another postgres process gets fully stuck on the xas_load/descend trace

I agree with your analysis if the process gets stuck in an infinite loop, but I’ve seen at least one instance where it appears to have left the loop at some point and IMHO that would be a condition that would allow data corruption.

> It's been 10 months since I asked Christain to help isolate a
> reproducer so we can track this down. Nothing came from that, so
> we're still at exactly where we were at back in november 2023 -
> waiting for information on a way to reproduce this issue more
> reliably.

Sorry for dropping the ball from my side as well - I’ve learned my lesson from trying to go through Bugzilla here. ;)

You mentioned above that this might involve read-ahead code and that’s something I noticed before: the machines that carry databases do run with a higher read-ahead setting (1MiB vs. 128k elsewhere).

Also, I’m still puzzled about the one variation that seems to involve page faults and not XFS. That’s something I haven’t seen a response to yet whether this IS in fact interesting or not. 

Christian

-- 
Christian Theune · ct@xxxxxxxxxxxxxxx · +49 345 219401 0
Flying Circus Internet Operations GmbH · https://flyingcircus.io
Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland
HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick