Re: Known and unfixed active data loss bug in MM + XFS with large folios since Dec 2021 (any kernel from 6.1 upwards)

Christian Theune <ct@xxxxxxxxxxxxxxx> · Wed, 18 Sep 2024 10:31:09 +0200

> On 16. Sep 2024, at 09:14, Christian Theune <ct@xxxxxxxxxxxxxxx> wrote:
> 
>> 
>> On 16. Sep 2024, at 02:00, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
>> 
>> I don't think this is a data corruption/loss problem - it certainly
>> hasn't ever appeared that way to me.  The "data loss" appeared to be
>> in incomplete postgres dump files after the system was rebooted and
>> this is exactly what would happen when you randomly crash the
>> system. i.e. dirty data in memory is lost, and application data
>> being written at the time is in an inconsistent state after the
>> system recovers. IOWs, there was no clear evidence of actual data
>> corruption occuring, and data loss is definitely expected when the
>> page cache iteration hangs and the system is forcibly rebooted
>> without being able to sync or unmount the filesystems…
>> All the hangs seem to be caused by folio lookup getting stuck
>> on a rogue xarray entry in truncate or readahead. If we find an
>> invalid entry or a folio from a different mapping or with a
>> unexpected index, we skip it and try again.  Hence this does not
>> appear to be a data corruption vector, either - it results in a
>> livelock from endless retry because of the bad entry in the xarray.
>> This endless retry livelock appears to be what is being reported.
>> 
>> IOWs, there is no evidence of real runtime data corruption or loss
>> from this pagecache livelock bug.  We also haven't heard of any
>> random file data corruption events since we've enabled large folios
>> on XFS. Hence there really is no evidence to indicate that there is
>> a large folio xarray lookup bug that results in data corruption in
>> the existing code, and therefore there is no obvious reason for
>> turning off the functionality we are already building significant
>> new functionality on top of.

I’ve been chewing more on this and reviewed the tickets I have. We did see a PostgreSQL database ending up reporting "ERROR: invalid page in block 30896 of relation base/16389/103292”. 

My understanding of the argument that this bug does not corrupt data is that the error would only lead to a crash-consistent state. So applications that can properly recover from a crash-consistent state would only experience data loss to the point of the crash (which is fine and expected) but should not end up in a further corrupted state.

PostgreSQL reporting this error indicates - to my knowledge - that it did not see a crash consistent state of the file system.

Christian

-- 
Christian Theune · ct@xxxxxxxxxxxxxxx · +49 345 219401 0
Flying Circus Internet Operations GmbH · https://flyingcircus.io
Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland
HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick