Re: [general question] rare silent data corruption when writing data

Chris Dunlop <chris@xxxxxxxxxxxx> · Thu, 14 May 2020 10:39:54 +1000

On Wed, May 13, 2020 at 01:49:10PM -0400, John Stoffel wrote:
I wonder if this problem can be replicated on loop devices?  Once
there's a way to cause it reliably, we can then start doing a
bisection of the kernel to try and find out where this is happening.

I ran a week or so of attempting to replicate the problem in a VM on loop 
devices replicating the lvm/raid config, without success. Basically just 
having a random bunch of 1-25 concurrent writers banging out middling to 
largish files.

The fact it wasn't replicable in that environment could be pointing 
towards the LSI driver or hardware - or I simply wasn't able to match  
the conditions well enough.

So far, it looks like it happens sometimes on bare RAID6 systems
without lv-thin in place, which is both good and bad.  And without
using VMs on top of the storage either.  So this helps narrow down the
cause.

Note: We don't have any bare RAID6 so I haven't seen it there: our main fs 
is xfs on sequential LVM on raid6 (6 x 11-disk sets), and we saw it once 
on xfs directly on HDD partition.

Is there any info on the work load on these systems?  Lots of small
fils which are added/removed?  Large files which are just written to
and not touched again?

Large files written and not touched again. Most of the time 2-5 concurrent 
writers but regularly (daily) up to 20-25 concurrent.

I assume finding a bad file with corruption and then doing a cp of the
file keeps the same corruption?

Yep.