Re: Help needed: RAID5 with two apparent simo drive failures

Phil Turmel <philip@xxxxxxxxxx> · Sun, 31 Mar 2019 14:31:23 -0400

On 3/31/19 2:14 PM, Jorge R. Frank wrote:
On 3/31/19 00:03, Phil Turmel wrote:

Consider not buying cheap drives when the time comes to replace.  The 
boot script will suit until then.

In my defense, I was young, stupid, and unsupervised when I built the 
array. Hard to argue with the results. The system has been running 
practically 24/7 since December 2008 and this is the first glitch I > couldn't fix by simply re-seating SATA cables and rebooting.

You've been extraordinarily lucky.

One thing I would like to confirm is where to call the SCT ERC script in 
the boot process. The wiki wasn't clear on that point.

It's not clear because it varies so much from distro to distro and even 
within distro versions.  Basically, it should be in your distro's 
version of rc.local, or even better, triggered by udev rules.

All of this is consistent with a controller issue knocking out those 
two drives simultaneously.  The correct solution is to use --assemble 
--force with explicit device names (not using --scan).

You should use fsck to clean up any unavoidable fs corruption from 
in-flight I/O before mounting.

Would you recommend explicitly including all four devices, since sdd and 
sde have the same event count? Or just three, arbitrarily picking one of 
sdd/sde to include, then adding a new fourth drive?

Use all four.  That way, if there are any lurking UREs, the array will 
fix itself (slowly, because of the long timeouts).

Due to the age of 
the system and the fact that the motherboard SATA controller now has a 
strike against it, my plan upon recovery is to immediately back up the 
array and replace the entire system. So if the former would work on a 
short-term basis, I'd be willing to try it.

Replacing the system is less important than replacing the drives.  If at 
all possible, move to raid6.

Thanks again,
JRF

You're welcome.

Phil