>>>>> "Phil" == Phil Turmel <philip@xxxxxxxxxx> writes: Phil> On 05/03/2013 09:52 AM, John Stoffel wrote: >> >> After watching endless threads about RAID5 arrays losing a disk, and >> then losing a second during the rebuild, I wonder if it would make >> sense to: >> >> - have MD automatically increase all disk timeouts when doing a >> rebuild. The idea being that we are more tolerant of a bad sector >> when rebuilding? The idea would be to NOT just evict disks when in >> potentially bad situations without trying really hard. Phil> This would be conterproductive for those users who actually Phil> follow manufacturer guidelines when selecting drives for their Phil> arrays. Well for them, which is drives supporting STEC, etc, you'd skip that step. But for those using consumer drives, it might make sense. And I didn't say to make this change for all arrays, just for those in a rebuilding state where losing another disk would be potentially fatal. Phil> Anyways, it's a policy issue that belongs in userspace. Distros Phil> can do this today if they want. There's no lack of scripts in Phil> this list's archives. Sure, but I'm saying that MD should push the policy to default to doing this. You can turn if off if you like and if you know enough. >> - Automatically setup an automatic scrub of the array that happens >> weekly unless you explicitly turn it off. This would possibly >> require changes from the distros, but if it could be made a core >> part of MD so that all the blocks in the array get read each week, >> that would help with silent failures. Phil> I understand some distros already do this. >> We've got all these compute cycles kicking around that could be used >> to make things even more reliable, we should be using them in some >> smart way. Phil> But the "smart way" varies with the hardware at hand. There's Phil> no "one size fits all" solution here. What's the common thread? A RAID5 loses a disk. While rebuilding, another disk goes south. Poof! The entire array is toast until you go through alot of manual steps to re-create. All I'm suggesting is that when in a degraded state, MD automatically becomes more tolerant of timeouts, errors and tries harder to keep going. John -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html