Hi list,
today I had an unexpected filesystem corruption on a RAID1 machine used
for backup purposes. I would like to reconstruct what possibly happened
on why, so I am asking for your help.
System specs:
- OS CentOS 7.2 x86_64 with kernel 3.10.0-514.6.1.el7.x86_64
- 2x SEAGATE ST4000VN000-1H4168 (4 TB 5900rpm disks)
- 4 GB DDR3 RAM
- Intel(R) Pentium(R) CPU G3260 @ 3.30GHz
Today, I found the machine crashed with an XFS warning about corrupted
metadata. The warning stated that in-core (or in-memory) data corruption
was detected so, thinking about a DRAM-related problem (no ECC memory on
this small box...) I simply rebooted tha machine. To no avail - the same
problem immediately happened, preventing the machine from booting (the
root filesystem did not mount).
After the filesystem was repaired (with significant corruption signs,
also due to the clearing of the XFS journal), I looked at dmesg and
found something interesting: a raid-resync action was *automatically*
performed, as when re-attaching a (detached) disk.
I start investigating in /var/log/messages and found plenty of these
errors, spanning many days:
...
Jul 10 03:24:01 nas kernel: ata1.00: failed command: READ FPDMA QUEUED
Jul 10 14:50:54 nas kernel: ata1.00: failed command: FLUSH CACHE EXT
Jul 12 03:14:41 nas kernel: ata1.00: failed command: WRITE FPDMA QUEUED
...
To me, it seems that a disks (the first one, sda) had problem executing
some SATA commands, becoming out-of-sync from the second one (sdb).
However it was not kicked out the array, as both /var/log/messages *and*
my custom monitoring script (which keep an eye on /proc/mdstat) reported
nothing. Moreover, inspecting both the SMART values and log show *no*
error at all.
Question 1: it is possible to have such a situation, where a failed
command *silently* put the array in out-of-sync state?
At a certain point, the machine crashed. I noticed and rebooted it.
Question 2: it is possible that the old disk become offline just before
the crash and, by rebooting, the mdadm re-added it to the array?
Question 3: if so, it is possible that the corruption was due to the
first disk being the one read by the md array and, by extension, by the
filesystem?
Any thoughts will be greatly appreciated.
Thanks.
--
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@xxxxxxxxxx - info@xxxxxxxxxx
GPG public key ID: FF5F32A8
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html