Hi,
On Thu, May 07, 2020 at 07:30:19PM +0200, Michal Soltys wrote:
Note: this is just general question - if anyone experienced something
similar or could suggest how to pinpoint / verify the actual cause.
Thanks to btrfs's checksumming we discovered somewhat (even if quite
rare) nasty silent corruption going on on one of our hosts. Or perhaps
"corruption" is not the correct word - the files simply have precise 4kb
(1 page) of incorrect data. The incorrect pieces of data look on their
own fine - as something that was previously in the place, or written
from wrong source.
"Me too!"
We are seeing 256-byte corruptions which are always the last 256b of a 4K
block. The 256b is very often a copy of a "last 256b of 4k block" from
earlier on the file. We sometimes see multiple corruptions in the same
file, with each of the corruptions being a copy of a different 256b from
earlier on the file. The original 256b and the copied 256b aren't
identifiably at a regular offset from each other. Where the 256b isn't a
copy from earlier in the file
I'd be really interested to hear if your problem is just in the last 256b
of the 4k block also!
We haven't been able to track down any the origin of any of the copies
where it's not a 256b block earlier in the file. I tried some extensive
analysis of some of these occurrences, including looking at files being
written around the same time, but wasn't able to identify where the data
came from. It could be the "last 256b of 4k block" from some other file
being written at the same time, or a non-256b aligned chunk, or indeed not
a copy of other file data at all.
See Also: https://lore.kernel.org/linux-xfs/20180322150226.GA31029@xxxxxxxxxxxx/
We've been able to detect these corruptions via an md5sum calculated as
the files are generated, where a later md5sum doesn't match the original.
We regularly see the md5sum match soon after the file is written (seconds
to minutes), and then go "bad" after doing a "vmtouch -e" to evict the
file from memory. I.e. it looks like the problem is occurring somewhere on
the write path to disk. We can move the corrupt file out of the way and
regenerate the file, then use 'cmp -l' to see where the corruption[s] are,
and calculate md5 sums for each 256b block in the file to identify where
the 256b was copied from.
The corruptions are far more likely to occur during a scrub, although we
have seen a few of them when not scrubbing. We're currently working around
the issue by scrubbing infrequently, and trying to schedule scrubs during
periods of low write load.
The hardware is (can provide more detailed info of course):
- Supermicro X9DR7-LN4F
- onboard LSI SAS2308 controller (2 sff-8087 connectors, 1 connected to
backplane)
- 96 gb ram (ecc)
- 24 disk backplane
- 1 array connected directly to lsi controller (4 disks, mdraid5,
internal bitmap, 512kb chunk)
- 1 array on the backplane (4 disks, mdraid5, journaled)
- journal for the above array is: mdraid1, 2 ssd disks (micron 5300 pro
disks)
- 1 btrfs raid1 boot array on motherboard's sata ports (older but still
fine intel ssds from DC 3500 series)
Ours is on similar hardware:
- Supermicro X8DTH-IF
- LSI SAS 9211-8i (LSI SAS2008, PCI-e 2.0, multiple firmware versions)
- 192GB ECC RAM
- A mix of 12 and 24-bay expanders (some daisy chained: lsi-expander-expander)
We swapped the LSI HBA for another of the same model, the problem
persisted. We have a SAS9300 card on the way for testing.
Raid 5 arrays are in lvm volume group, and the logical volumes are used
by VMs. Some of the volumes are linear, some are using thin-pools (with
metadata on the aforementioned intel ssds, in mirrored config). LVM uses
large extent sizes (120m) and the chunk-size of thin-pools is set to
1.5m to match underlying raid stripe. Everything is cleanly aligned as
well.
We're not using VMs nor lvm thin on this storage.
Our main filesystem is xfs + lvm + raid6 and this is where we've seen all
but one of these corruptions (70-100 since Mar 2018).
The problem has occurred on all md arrays under the lvm, on disks from
multiple vendors and models, and on disks attached to all expanders.
We've seen one of these corruptions with xfs directly on a hdd partition.
I.e. no mdraid or lvm involved. This fs an order of magnitude or more less
utilised than the main fs in terms of data being written.
We did not manage to rule out (though somewhat _highly_ unlikely):
- lvm thin (issue always - so far - occured on lvm thin pools)
- mdraid (issue always - so far - on mdraid managed arrays)
- kernel (tested with - in this case - debian's 5.2 and 5.4 kernels,
happened with both - so it would imply rather already longstanding bug
somewhere)
- we're not using lvm thin
- problem has occurred once on non-mdraid (xfs directly on a hdd partition)
- problem NOT seen on kernel 3.18.25
- problem seen on, so far, kernels 4.4.153 - 5.4.2
And finally - so far - the issue never occured:
- directly on a disk
- directly on mdraid
- on linear lvm volume on top of mdraid
- seen once directly on disk (partition)
- we don't use mdraid directly
- our problem arises on linear lvm on top of mdraid (raid6)
As far as the issue goes it's:
- always a 4kb chunk that is incorrect - in a ~1 tb file it can be from
a few to few dozens of such chunks
- we also found (or rather btrfs scrub did) a few small damaged files as
well
- the chunks look like a correct piece of different or previous data
The 4kb is well, weird ? Doesn't really matter any chunk/stripes sizes
anywhere across the stack (lvm - 120m extents, 1.5m chunks on thin
pools; mdraid - default 512kb chunks). It does nicely fit a page though
...
Anyway, if anyone has any ideas or suggestions what could be happening
(perhaps with this particular motherboard or vendor) or how to pinpoint
the cause - I'll be grateful for any.
Likewise!
Cheers,
Chris