Re: Known and unfixed active data loss bug in MM + XFS with large folios since Dec 2021 (any kernel from 6.1 upwards)

Chris Mason <clm@xxxxxxxx> · Fri, 11 Oct 2024 09:06:00 -0400

On 10/11/24 5:08 AM, Christian Theune wrote:
> 
>> On 11. Oct 2024, at 09:27, Christian Theune <ct@xxxxxxxxxxxxxxx> wrote:
>>
>> I’m going to gather a few more instances during the day and will post them as a batch later.
> 
> I’ve received 8 alerts in the last hours and managed to get detailed, repeated walker output from two of them:
> 
> - FC-41287.log
> - FC-41289.log

These are really helpful.

If io throttling were the cause, the traces should also have a process
that's waiting to submit the IO, but that's not present here.

Another common pattern is hung tasks with a process stuck in the kernel
burning CPU, but holding a lock or being somehow responsible for waking
the hung task.  Your process listings don't have that either.

One part I wanted to mention:

[820710.974122] Future hung task reports are suppressed, see sysctl
kernel.hung_task_warnings

By default you only get 10 or so hung task notifications per boot, and
after that they are suppressed. So for example, if you're watching a
count of hung task messages across a lot of machines and thinking that
things are pretty stable because you're not seeing hung task messages
anymore...the kernel might have just stopped complaining.

This isn't exactly new kernel behavior, but it can be a surprise.

Anyway, this leaves me with ~3 theories:

- Linus's starvation observation.  It doesn't feel like there's enough
load to cause this, especially given us sitting in truncate, where it
should be pretty unlikely to have multiple procs banging on the page in
question.

- Willy's folio->mapping check idea.  I _think_ this is also wrong, the
reference counts we have in the truncate path check folio->mapping
before returning, and we shouldn't be able to reuse the folio in a
different mapping while we have the reference held.

If this is the problem it would mean our original bug is slightly
unfixed.  But the fact that you're not seeing other problems, and these
hung tasks do resolve should mean we're ok.  We can add a printk or just
run a drgn script to check.

- It's actually taking the IO a long time to finish.  We can poke at the
pending requests, how does the device look in the VM?  (virtio, scsi etc).

-chris