Re: Data corruption with XFS on Debian 11 and 12 under heavy load.

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 30 Aug 2023 07:54:39 +1000

On Tue, Aug 29, 2023 at 06:15:36PM +0100, Jose M Calhariz wrote:
> 
> Hi,
> 
> I have been chasing a data corruption problem under heavy load on 4
> servers that I have at my care.  First I thought of an hardware
> problem because it only happen with RAID 6 disks.  So I reported to Debian: 
> 
> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1032391

Summary: corruption on HW RAID6, not on a separate HW RAID1 volume
on the same controller.

Firmware update of HW RAID controller made on disk corruption on
RAID6 volumes go away, but weird compiler failures still occurred
indicating data corruption was likely still occuring.

Updating kernel to "bookworm" which runs a 6.3 kernel didn't fix the
problem.

This smells of corruption occurring on read IO, not on write IO, and
likely a hardware related problem given the change of behaviour with
a firmware update.

> Further research pointed to be the XFS the common pattern, not an
> hardware issue.  So I made an informal query to a friend in a software
> house that relies heavily on XFS about his thought on this issue.  He
> made reference to several problems fixed on kernel 6.2 and a
> discussion on this mailing list about back porting the fixes to 6.1
> kernel.

I can't think of any bug fix we've been talking about backporting to
6.1 that might fix a data corruption? Anything that is a known data
corruption fix normally gets backported pretty quickly (e.g. the
corruption that could be triggered in 6.3.0-6.3.4 kernels had the
fix backported into 6.3.5 as soon as we identified the cause).

> With this information I have tried the latest kernel at that time on
> Debian testing over Debian v12 and I could not reproduce the
> problem.  So I made another bug report:
> 
> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1040416

Your test case of make -j 4096 fails on 6.1.27 but does not fail on
6.3.7. Which is different behaviour to the above bug. This time you
have a kernel log that indicates XFS appears to be hung up waiting
for an AGI lock during inode allocation from the hung task timer.

This does not indicate any sort of corruption is occurring - it
means either the storage is really slow (i.e. waiting for IO
completion on either the AGI, or IO completion on whatever is
holding the AGI lock) or there has been a deadlock of some kind.
EIther way, this sort of thing is not an indication of data corruption.

You also don't mention what storage hardware this is on - is this
still on the HW RAID6 volumes that were causing issues that you
reported in the first bug above?

----

There's really nothing in either of these bug reports that indicate
that XFS is the root cause, whilst there's plenty of anecdotal
evidence from the first bug to point at storage hardware
problems being the cause.

So, which of these problems is easiest to reproduce on your
machines? Pick one of them and:

- describe the storage hardware stack (BBWC, RAID, caching strategy)
- describe the storage software stack (drdb, lvm, xfs_info for the
  filesystem, etc)
- cpus, memory, etc
- example of a corrupt data file vs a good file (i.e. what is the
  corrupt data that is appearing in the corrupt .o files?)
- find the minimum storage stack that reproduces the problem, and
  determine if the problem reproduces across different storage
  hardware in the same machine.
- if you have known bad and known good kernels, run a bisect and see
  where the problem goes away (e.g. which -rcX kernel between good
  and bad results in the problem going away).

> My questions to this mailing list:
> 
>   - Have anyone experienced under Debian or with vanilla kernels
>   corruption under heavy load on XFS?

No.

I do long term kernel soak testing with my main workstation with
debian kernels (i.e. months of uptime, daily use with hundreds of
browser tabs, tens of terminals, multiple VMs, lots of source tree
work, all on XFS filesystems. I've been running this kernel:

Linux devoid 6.1.0-9-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.27-1 (2023-05-08) x86_64 GNU/Linux

on this machine for some months.

>   - Should I stop waiting for the fixes being back ported to vanilla
>   6.1 and run the latest kernel from Debian testing anyway?  Taking
>   notice that kernels from testing have less security updates on time
>   than stable kernels, specially security issues with limited
>   disclosure.

There's nothing to "fix" or backport until we've done root cause
analysis on the failures and identified what is actually causing
your systems to fail.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx