Re: Known and unfixed active data loss bug in MM + XFS with large folios since Dec 2021 (any kernel from 6.1 upwards)

Chris Mason <clm@xxxxxxxx> · Tue, 17 Sep 2024 11:36:51 +0200

On 9/17/24 5:32 AM, Matthew Wilcox wrote:
> On Mon, Sep 16, 2024 at 10:47:10AM +0200, Chris Mason wrote:
>> I've got a bunch of assertions around incorrect folio->mapping and I'm
>> trying to bash on the ENOMEM for readahead case.  There's a GFP_NOWARN
>> on those, and our systems do run pretty short on ram, so it feels right
>> at least.  We'll see.
> 
> I've been running with some variant of this patch the whole way across
> the Atlantic, and not hit any problems.  But maybe with the right
> workload ...?
> 
> There are two things being tested here.  One is whether we have a
> cross-linked node (ie a node that's in two trees at the same time).
> The other is whether the slab allocator is giving us a node that already
> contains non-NULL entries.
> 
> If you could throw this on top of your kernel, we might stand a chance
> of catching the problem sooner.  If it is one of these problems and not
> something weirder.
> 

I was able to corrupt the xarray one time, hitting a crash during
unmount.  It wasn't the xfs filesystem I was actually hammering so I
guess that tells us something, but it was after ~3 hours of stress runs,
so not really useful.

I'll try with your patch as well.

-chris