Re: md failing mechanism

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Oh! Thank you! I really wanted to see a reliable "what's supposed to happen" sequence!

As for my case, those were indeed, um, "cheap desktop drives" - to be precise, some 80-Gb IDE drives in a Pentium-4 machine; "it works well for a small file server", I thought, oblivious to the finer details about the process of failure handling... But, I also have "big" file servers, so that timeout mismatch issue is something worth paying attention!

And also, now I understand why I probably "should have been scrubbing". =/ Do I understand correctly that "scrubbing" means those "monthly redundancy checks" that mdadm suggests? And I suppose what it does is just the same - read every sector and attempt to write it back upon failure, otherwise kicking the device?


So, I understand a common problem now: the read timeout on the "desktop" drives is too long, which makes sense for the desktops, but not for RAIDs, because the "write back attempt" fails and leads to "BOOM" and kick. Enterprise-grade drives, however, offer an option to change their timeout, which is called "TL;DR technology" (yes, that's how I'm going to call it! Because I can't remember the acronym no matter how may times I read it, and the meaning kinda fits!). And what about drives that do not support it?.. Do they even have some kid of huge timeout or something?.. Yesterday I've been checking one drive for bad blocks (badblocks read-only test), and it took no more than two seconds per block to confirm its... badness!

As I understand, one way around this problem is to change the kernel timeout to exceed the drive timeout by changing /sys/block/sd?/device/timeout to something larger than the default 30, but I'd have to do that after every reboot, is all that correct?


Still, I don't think it has anything to do with what has happened to my "small file server"... It was the opposite; for some reason, it was not kicked from the array. But, it happened a while ago, and I've destroyed the array afterwards, so I can't get any more data about that incident. But, I've got what I wanted: I now I know what is supposed to happen when a drive in a RAID fails, and it's not what happened that time. And I know I should set up proper TL;DR timeouts and scrubbing...


--
darkpenguin
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux RAID Wiki]     [ATA RAID]     [Linux SCSI Target Infrastructure]     [Linux Block]     [Linux IDE]     [Linux SCSI]     [Linux Hams]     [Device Mapper]     [Device Mapper Cryptographics]     [Kernel]     [Linux Admin]     [Linux Net]     [GFS]     [RPM]     [git]     [Yosemite Forum]


  Powered by Linux