Re: Filesystem corruption on RAID1

Reindl Harald <h.reindl@xxxxxxxxxxxxx> · Fri, 14 Jul 2017 12:58:39 +0200

Am 14.07.2017 um 12:46 schrieb Gionatan Danti:
Il 14-07-2017 02:32 Reindl Harald ha scritto:
because you won't be that happy when the kernel spits out a disk each
time a random SATA command times out - the 4 RAID10 disks on my
workstation are from 2011 and showed them too several times in the
past while they are just fine

here you go:
http://strugglers.net/~andy/blog/2015/11/09/linux-software-raid-and-drive-timeouts/ 

Hi, so a premature/preventive drive detachment is not a silver bullet, 
and I buy it. However, I would at least expect this behavior to be 
configurable. Maybe it is, and I am missing something?

dunno, maybe it is, but it wouldn't be wise because in case of a RAID5 
rebuilding after a disk-failure would become even more dangerous on a 
large array as it is already

Anyway, what really surprise me is *not* the drive to not be detached, 
rather permitting that corruption make its way into real data. I naively 
expect that when a WRITE_QUEUED or CACHE_FLUSH command aborts/fails 
(which *will* cause data corruption if not properly handled) the I/O 
layer has the following possibilities:

given that i have seen at least SD-cards confirming over hours sucessful 
writes with no single error in the syslog maybe it was one of the rare 
cases where the hardware lied and if that is the case you have nearly no 
chance on the software layer except verify each write with a uncached 
read of the block which would have a unacceptable impact on performance
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html