Known and unfixed active data loss bug in MM + XFS with large folios since Dec 2021 (any kernel from 6.1 upwards)

Christian Theune <ct@xxxxxxxxxxxxxxx> · Thu, 12 Sep 2024 23:18:34 +0200

Hello everyone,

I’d like to raise awareness about a bug causing data loss somewhere in MM interacting with XFS that seems to have been around since Dec 2021 (https://github.com/torvalds/linux/commit/6795801366da0cd3d99e27c37f020a8f16714886).

We started encountering this bug when upgrading to 6.1 around June 2023 and we have had at least 16 instances with data loss in a fleet of 1.5k VMs.

This bug is very hard to reproduce but has been known to exist as a “fluke” for a while already. I have invested a number of days trying to come up with workloads to trigger it quicker than that stochastic “once every few weeks in a fleet of 1.5k machines", but it eludes me so far. I know that this also affects Facebook/Meta as well as Cloudflare who are both running newer kernels (at least 6.1, 6.6, and 6.9) with the above mentioned patch reverted. I’m from a much smaller company and seeing that those guys are running with this patch reverted (that now makes their kernel basically an untested/unsupported deviation from the mainline) smells like desparation. I’m with a much smaller team and company and I’m wondering why this isn’t tackled more urgently from more hands to make it shallow (hopefully).

The issue appears to happen mostly on nodes that are running some kind of database or specifically storage-oriented load. In our case we see this happening with PostgreSQL and MySQL. Cloudflare IIRC saw this with RocksDB load and Meta is talking about nfsd load.

I suspect low memory (but not OOM low) / pressure and maybe swap conditions seem to increase the chance of triggering it - but I might be completely wrong on that suspicion.

There is a bug report I started here back then: https://bugzilla.kernel.org/show_bug.cgi?id=217572 and there have been discussions on the XFS list: https://lore.kernel.org/lkml/CA+wXwBS7YTHUmxGP3JrhcKMnYQJcd6=7HE+E1v-guk01L2K3Zw@xxxxxxxxxxxxxx/T/ but ultimately this didn’t receive sufficient interested to keep it moving forward and I ran out of steam. Unfortunately we can’t be stuck on 5.15 forever and other kernel developers correctly keep pointing out that we should be updating, but that isn’t an option as long as this time bomb still exists.

Jens pointed out that Meta's findings and their notes on the revert included "When testing nfsd on top of v5.19, we hit lockups in filemap_read(). These ended up being because the xarray for the files being read had pages from other files mixed in."

XFS is known to me and admired for the very high standards they represent regarding testing and avoiding data loss but ultimately that doesn’t matter if we’re going to be stuck with this bug forever.

I’m able to help funding efforts, help creating a reproducer, generally donate my time (not a kernel developer myself) and even provide access to machines that did see the crash (but don’t carry customer data), but I’m not making any progress or getting any traction here.

Jens encouraged me to raise the visibility in this way - so that’s what I’m trying here.

Please help.

In appreciation of all the hard work everyone is putting in and with hugs and love,
Christian

-- 
Christian Theune · ct@xxxxxxxxxxxxxxx · +49 345 219401 0
Flying Circus Internet Operations GmbH · https://flyingcircus.io
Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland
HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick