Re: [general question] rare silent data corruption when writing data

Michal Soltys <msoltyspl@xxxxxxxxx> · Wed, 20 May 2020 22:29:29 +0200

On 20/05/13 08:31, Chris Dunlop wrote:
Hi,

"Me too!"

We are seeing 256-byte corruptions which are always the last 256b of a 
4K block. The 256b is very often a copy of a "last 256b of 4k block" 
from earlier on the file. We sometimes see multiple corruptions in the 
same file, with each of the corruptions being a copy of a different 256b 
from earlier on the file. The original 256b and the copied 256b aren't 
identifiably at a regular offset from each other. Where the 256b isn't a 
copy from earlier in the file

I'd be really interested to hear if your problem is just in the last 
256b of the 4k block also!

From what I have checked - in my case it has always been full 4k page.

I'll follow the suggestion by Sarah in the other part of this thread and 
enable pagealloc debug options and then put the machine/disks under load 
- so I'll keep an eye if something like you described happens.

This will have to wait a bit though, as I have another bug to hunt as 
well - as journaled raid refuses to assemble, so with help of Song I'm 
chasing that issue first.

If not for btrfs, we probably would have been using the machine happily 
until now (blaming occasional detected issues on userspace stuff, 
usually some fat java mess).

Thanks for detailed explanations of what happened in your case (and the 
span of kernel versions in which it does happen is scary). The hardware 
indeed looks strikingly similiar.