Oh! Thank you! I really wanted to see a reliable "what's supposed to
happen" sequence!
As for my case, those were indeed, um, "cheap desktop drives" - to be
precise, some 80-Gb IDE drives in a Pentium-4 machine; "it works well
for a small file server", I thought, oblivious to the finer details
about the process of failure handling... But, I also have "big" file
servers, so that timeout mismatch issue is something worth paying attention!
And also, now I understand why I probably "should have been scrubbing".
=/ Do I understand correctly that "scrubbing" means those "monthly
redundancy checks" that mdadm suggests? And I suppose what it does is
just the same - read every sector and attempt to write it back upon
failure, otherwise kicking the device?
So, I understand a common problem now: the read timeout on the "desktop"
drives is too long, which makes sense for the desktops, but not for
RAIDs, because the "write back attempt" fails and leads to "BOOM" and
kick. Enterprise-grade drives, however, offer an option to change their
timeout, which is called "TL;DR technology" (yes, that's how I'm going
to call it! Because I can't remember the acronym no matter how may times
I read it, and the meaning kinda fits!). And what about drives that do
not support it?.. Do they even have some kid of huge timeout or
something?.. Yesterday I've been checking one drive for bad blocks
(badblocks read-only test), and it took no more than two seconds per
block to confirm its... badness!
As I understand, one way around this problem is to change the kernel
timeout to exceed the drive timeout by changing
/sys/block/sd?/device/timeout to something larger than the default 30,
but I'd have to do that after every reboot, is all that correct?
Still, I don't think it has anything to do with what has happened to my
"small file server"... It was the opposite; for some reason, it was not
kicked from the array. But, it happened a while ago, and I've destroyed
the array afterwards, so I can't get any more data about that incident.
But, I've got what I wanted: I now I know what is supposed to happen
when a drive in a RAID fails, and it's not what happened that time. And
I know I should set up proper TL;DR timeouts and scrubbing...
--
darkpenguin
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html