Re: Known and unfixed active data loss bug in MM + XFS with large folios since Dec 2021 (any kernel from 6.1 upwards)

Chris Mason <clm@xxxxxxxx> · Fri, 13 Sep 2024 11:30:41 -0400

On 9/12/24 6:25 PM, Linus Torvalds wrote:
> On Thu, 12 Sept 2024 at 15:12, Jens Axboe <axboe@xxxxxxxxx> wrote:
>>
>> When I saw Christian's report, I seemed to recall that we ran into this
>> at Meta too. And we did, and hence have been reverting it since our 5.19
>> release (and hence 6.4, 6.9, and 6.11 next). We should not be shipping
>> things that are known broken.
> 
> I do think that if we have big sites just reverting it as known broken
> and can't figure out why, we should do so upstream too.

I've mentioned this in the past to both Willy and Dave Chinner, but so
far all of my attempts to reproduce it on purpose have failed.  It's
awkward because I don't like to send bug reports that I haven't
reproduced on a non-facebook kernel, but I'm pretty confident this bug
isn't specific to us.

I'll double down on repros again during plumbers and hopefully come up
with a recipe for explosions.  On other important datapoint is that we
also enable huge folios via tmpfs mount -o huge=within_size.

That hasn't hit problems, and we've been doing it for years, but of
course the tmpfs usage is pretty different from iomap/xfs.

We have two workloads that have reliably seen large folios bugs in prod.
 This is all on bare metal systems, some are two socket, some single,
nothing really exotic.

1) On 5.19 kernels, knfsd reading and writing to XFS.  We needed
O(hundreds) of knfsd servers running for about 8 hours to see one hit.

The issue looked similar to Christian Theune's rcu stalls, but since it
was just one CPU spinning away, I was able to perf probe and drgn my way
to some details.  The xarray for the file had a series of large folios:

[ index 0 large folio from the correct file ]
[ index 1: large folio from the correct file ]
...
[ index N: large folio from a completely different file ]
[ index N+1: large folio from the correct file ]

I'm being sloppy with index numbers, but the important part is that
we've got a large folio from the wrong file in the middle of the bunch.

filemap_read() iterates over batches of folios from the xarray, but if
one of the folios in the batch has folio->offset out of order with the
rest, the whole thing turns into a infinite loop.  It's not really a
filemap_read() bug, the batch coming back from the xarray is just incorrect.

2) On 6.9 kernels, we saw a BUG_ON() during inode eviction because
mapping->nrpages was non-zero.  I'm assuming it's really just a
different window into the same bug.  Crash dump analysis was less
conclusive because the xarray itself was always empty, but turning off
large folios made the problem go away.

This happened ~5-10 times a day, and the service had a few thousand
machines running 6.9.  If I can't make an artificial repro, I'll try and
talk the service owners into setting up a production shadow to hammer on
it with additional debugging.

We also disabled large folios for our 6.4 kernel, but Stefan actually
tracked that bug down:

commit a48d5bdc877b85201e42cef9c2fdf5378164c23a
Author: Stefan Roesch <shr@xxxxxxxxxxxx>
Date:   Mon Nov 6 10:19:18 2023 -0800

    mm: fix for negative counter: nr_file_hugepages

We didn't have time to revalidate with large folios back on afterwards.

-chris