Re: using the raid6check report

Eyal Lebedinsky <eyal@xxxxxxxxxxxxxx> · Mon, 9 Jan 2017 08:20:46 +1100

On 09/01/17 08:06, Wols Lists wrote:
On 08/01/17 20:46, Piergiorgio Sartor wrote:
[trim]

If one of the parity sectors is corrupted, it's easy. Calculate parity
from the data, and either P or Q will be wrong, so fix it. But if it's a
*data* sector that's corrupted, both P and Q will be wrong. How easy is
it to work back from that, and work out *which* data sector is wrong? My
fu makes me think you can't, though I could quite easily be wrong :-)

My understanding of RAID6 is that you CAN say which of the data/P/Q is
wrong if one assumes only one is wrong.

Is this not what raid6check claims to do?
	"In case of parity mismatches, "raid6check" reports, if possible,
	"which component drive could be responsible"

But should that even happen, unless a disk is on its way out, anyway?

Not so. I get, from time to time, non zero mismatch where I saw no disk
errors of any sort in kernel messages or in smart status.

I remember years ago, back in the 80s, our minicomputers had
error-correction in the drive. I don't remember the algorithm, but it
wrote 16-bit words to disk - each an 8-bit data byte. The first half was
the original data, and the second half was some parity pattern such that
for any single-bit corruption you knew which half was corrupt, and you
could throw away the corrupt parity, or recreate the correct data from
the parity. Even with a 2-bit error I think it was >90% detection and
recreation. I can't imagine something like that not being in drive
hardware today.

The disk thinks it has good data but md thinks not. Maybe bad data was
written due to some other bug? A corner case when the system rebooted
unexpectedly? Maybe the controller corrupted the data?

Cheers,
Wol

--
Eyal Lebedinsky (eyal@xxxxxxxxxxxxxx)
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html