Re: Filesystem corruption on RAID1

Chris Murphy <lists@xxxxxxxxxxxxxxxxx> · Thu, 13 Jul 2017 19:48:29 -0600

On Thu, Jul 13, 2017 at 4:34 PM, Gionatan Danti <g.danti@xxxxxxxxxx> wrote:
> Il 13-07-2017 23:34 Reindl Harald ha scritto:
>>
>> maybe because the disk is, well, not in a good shape and don't know
>> that by itself
>>
>
> But the kernel *does* know that, as the dmesg entries clearly show.
> Basically, some SATA commands timed-out and/or were aborted. As the kernel
> reported these erros in dmesg, why do not use these information to stop a
> failing disk?
>
>>
>> (and no filesystems with checksums won't magically recover
>> your data, they just tell you realier they are gone)
>>
>
> Checksummed filesystem that integrates their block-level management (read:
> ZFS or BTRFS) can recover the missing/corrupted data by the healthy disks,
> discarging corrupted data based on the checksum mismatch.
>
> Anyway, this has nothing to do with linux software RAID. I was only
> "thinking loud" :)
> Thanks.
>
>

Dealing with device betrayal at a hardware level is a difficult
problem. I'm under the impression the md driver is very intolerant of
write failure and would eject a drive even with a single failed write?
It would seem to be disqualifying for RAID.

Btrfs still tolerates many errors, read and write, so it can still be
a problem there too. But yes it does have an independent way to
unambiguously determine whether file system metadata, or extent data,
is corrupt. It also often keeps two copies of metadata (the file
system itself). Another option (read-only) is dm-verity, but that is
not RAID, it uses forward error correction and cryptographic hash
verification.

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html