Re: md devices: Suggestion for in place time and checksum within the RAID

Joachim Otahal <Jou@xxxxxxx> · Sun, 14 Mar 2010 12:58:50 +0100

Keld Simonsen schrieb:
On Sun, Mar 14, 2010 at 02:25:38AM +0100, Joachim Otahal wrote
Question:
Will RAID4/5/6 in the future use the parity upon read too? Currently
it would not detect wrong data reads from the parity chunk, resulting
in a disaster when it is actually needed.

Do those plans already exist and my post was completely useless?

Sorry that I cannot give patches, my last kernel patch + compile was
2.2.26, since then I never compiled a kernel.

Joachim Otahal

Hmm, would that not be detected by a check - initiated by cron?

Debian schedules a monthly check (first sunday 00:57), IMHO the best 
possible time and frequency, less is dangerous, more is useless. I added 
a cronjob to check every 15 minutes for changes from /proc/mdstat and 
changes from smart info (reallocated sector count and drive internal 
error list only) and emails me if something changed from the previous check.
I use the script because /etc/mdadm/mdadm.conf only takes ONE email 
address and requires a local MTA installed, I allways uninstall the 
local MTA if the machine is not going to be a mail server.
But why not checking parity during normal read operation? Was that a 
performance decision? It is not _that_ bad not doing it during normal 
operation since the good dists schedule a regular check, but can it be 
controlled by something like echo "1" > 
/proc/sys/dev/raid/always_read_parity ?

Which data to believe could then be determined according to a number
of techniques, like for a 3 copy array the best 2 out of 3,
investigating the error log of the drives, and relaying the error
information to the file system layer for manual inspection and repair.

That is a matter of "believe" and "best guess" and not "knowing" which 
contains the correct data in redundant array levels, hence the 
suggestion from before to include a timer + ECC (or better) at the raid 
level, so we actually _know_ which is the newest, and we _know_ which 
stripe does have consistent data, no guessing needed, we can apply 
crystal clear rules.
My ruleset would be:
first use: newest time and correct ECC
second use: newest time and correctable ECC
third use: any time and correct ECC (hint possible filesystem error to 
the lext layer)
fourth use: any time and correctable ECC (hint possible filesystem error 
to the lext layer)
fifth use: Current implementation, use the data from the active drive 
ordering according to the list in the superblock + hint possible 
filesystem error to the lext layer.
A raid aware filesystem would be perfect (compare with ZFS on Solaris) 
eliminating the write hole problem, doing the checksum at raid level 
makes it more flexible.

I would expect this is not something that occurs frequently, so maybe
once a year for the unlucky or systems with many disks.

If you get paranoid about corrupting really important data once in 5 
years too much. Implementing the checksum + timestamp would lift linux 
software raid to the next level, closer to enterprise where such 
techniques are actually in use. At it's current level it is very good 
and solid, so it is time to get to the next level for long time archiving.

regards,

Joachim Otahal

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html