Re: Filesystem corruption on RAID1

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Il 13-07-2017 18:48 Roman Mamedov ha scritto:

Failed reads are not as bad, as they are just retried.


I agree, I reported them only to give a broad picture of the system state :)

Jul 12 03:14:41 nas kernel: ata1.00: failed command: WRITE FPDMA QUEUED

But these WILL cause incorrect data written to disk, in my experience. After that, one of your disks will contain some corruption, whether in files, or (as
you discovered) in the filesystem itself.

This is the "scary" part: if the write was not acknowledged as committed to disk, why the block layer did not report it to the MD driver? Or if the block layer reported that, why MD did not kick the disk out of the array?

mdadm may or may not read from that
disk, as it chooses the mirror for reads pretty much randomly, using the least loaded one. And even though the other disk still contains good data, there is no mechanism for the user-space to say "hey, this doesn't look right, what's
on the other mirror?"

I understand and agree with that. I'm fully aware that MD can not (by design) detect/correct corrupted data. However, I wonder if, and why, a disk with obvious errors was not kicked out of the array.


Check your cables and/or disks themselves.


I tried reseating and inverting the cables ;)
Let see if the problem disappears or if it "follow" the cable/drive/interface...

If you know that only one disk had these write errors all the time, you could
try disconnecting it from mirror, and checking if you can get a more
consistent view of the filesystem on the remaining one.

P.S: about my case (which I witnessed on a RAID6):

* copy a file to the array, one disk will hit tons of WRITE FPDMA QUEUED
    errors (due to insufficient power and/or bad data cable).
* the file that was just copied, turns out to be corrupted when reading back.
  * the problem disk WILL NOT get kicked from the array during this.

Wow, a die-hard data corruption. It seems VERY similar to what happened to me, and the key problem seems the same: a failing drive was not detached from the array in a timely fashion.

Thanks very much for reporting, Roman.

--
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@xxxxxxxxxx - info@xxxxxxxxxx
GPG public key ID: FF5F32A8
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux RAID Wiki]     [ATA RAID]     [Linux SCSI Target Infrastructure]     [Linux Block]     [Linux IDE]     [Linux SCSI]     [Linux Hams]     [Device Mapper]     [Device Mapper Cryptographics]     [Kernel]     [Linux Admin]     [Linux Net]     [GFS]     [RPM]     [git]     [Yosemite Forum]


  Powered by Linux