On Tue, Aug 29, 2023 at 06:15:36PM +0100, Jose M Calhariz wrote: > > Hi, > > I have been chasing a data corruption problem under heavy load on 4 > servers that I have at my care. First I thought of an hardware > problem because it only happen with RAID 6 disks. So I reported to Debian: > > https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1032391 Summary: corruption on HW RAID6, not on a separate HW RAID1 volume on the same controller. Firmware update of HW RAID controller made on disk corruption on RAID6 volumes go away, but weird compiler failures still occurred indicating data corruption was likely still occuring. Updating kernel to "bookworm" which runs a 6.3 kernel didn't fix the problem. This smells of corruption occurring on read IO, not on write IO, and likely a hardware related problem given the change of behaviour with a firmware update. > Further research pointed to be the XFS the common pattern, not an > hardware issue. So I made an informal query to a friend in a software > house that relies heavily on XFS about his thought on this issue. He > made reference to several problems fixed on kernel 6.2 and a > discussion on this mailing list about back porting the fixes to 6.1 > kernel. I can't think of any bug fix we've been talking about backporting to 6.1 that might fix a data corruption? Anything that is a known data corruption fix normally gets backported pretty quickly (e.g. the corruption that could be triggered in 6.3.0-6.3.4 kernels had the fix backported into 6.3.5 as soon as we identified the cause). > With this information I have tried the latest kernel at that time on > Debian testing over Debian v12 and I could not reproduce the > problem. So I made another bug report: > > https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1040416 Your test case of make -j 4096 fails on 6.1.27 but does not fail on 6.3.7. Which is different behaviour to the above bug. This time you have a kernel log that indicates XFS appears to be hung up waiting for an AGI lock during inode allocation from the hung task timer. This does not indicate any sort of corruption is occurring - it means either the storage is really slow (i.e. waiting for IO completion on either the AGI, or IO completion on whatever is holding the AGI lock) or there has been a deadlock of some kind. EIther way, this sort of thing is not an indication of data corruption. You also don't mention what storage hardware this is on - is this still on the HW RAID6 volumes that were causing issues that you reported in the first bug above? ---- There's really nothing in either of these bug reports that indicate that XFS is the root cause, whilst there's plenty of anecdotal evidence from the first bug to point at storage hardware problems being the cause. So, which of these problems is easiest to reproduce on your machines? Pick one of them and: - describe the storage hardware stack (BBWC, RAID, caching strategy) - describe the storage software stack (drdb, lvm, xfs_info for the filesystem, etc) - cpus, memory, etc - example of a corrupt data file vs a good file (i.e. what is the corrupt data that is appearing in the corrupt .o files?) - find the minimum storage stack that reproduces the problem, and determine if the problem reproduces across different storage hardware in the same machine. - if you have known bad and known good kernels, run a bisect and see where the problem goes away (e.g. which -rcX kernel between good and bad results in the problem going away). > My questions to this mailing list: > > - Have anyone experienced under Debian or with vanilla kernels > corruption under heavy load on XFS? No. I do long term kernel soak testing with my main workstation with debian kernels (i.e. months of uptime, daily use with hundreds of browser tabs, tens of terminals, multiple VMs, lots of source tree work, all on XFS filesystems. I've been running this kernel: Linux devoid 6.1.0-9-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.27-1 (2023-05-08) x86_64 GNU/Linux on this machine for some months. > - Should I stop waiting for the fixes being back ported to vanilla > 6.1 and run the latest kernel from Debian testing anyway? Taking > notice that kernels from testing have less security updates on time > than stable kernels, specially security issues with limited > disclosure. There's nothing to "fix" or backport until we've done root cause analysis on the failures and identified what is actually causing your systems to fail. Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx