Re: Raid 6 Fail Event

Chris Murphy <lists@xxxxxxxxxxxxxxxxx> · Sun, 16 Nov 2014 12:52:02 -0700

On Nov 16, 2014, at 8:39 AM, Justin Stephenson <justin@xxxxxxxxxxxxxxxxx> wrote:

> Hello,
> 
> I am new to MDADM and have just experienced my first device fail on my raid 6.
> 
> I am wondering if someone might be able to help by outlining a proper protocol for troubleshooting and rebuilding this array (proc/mdstat below).
> 
> Here is how I might approach it:
> 
> - remove the device
> - test the device
> - if the device tests OK then re add the device
> - if the device fails, then replace the device
> - resync
> 
> Thank-you for your consideration.
> 
> Best,
> 
> - Justin
> 
> Here is the mdstat email
> 
> -----------------
> 
> This is an automatically generated mail message from mdadm
> running on BigBlue
> 
> A Fail event had been detected on md device /dev/md0.
> 
> It could be related to component device /dev/sdh1.

First step is getting the backup current. 

Second you can do this without removing the device:

# smartctl -x /dev/sdh

And then look in dmesg for errors related to its ata designation. You should be able to get a serial number from the smartctl output and can search that with dmesg | grep <serial#> to find out what it’s ata designation (port and device number) is, then you can dmesg | grep ataX.YY to get any read/write error events that explain what’s going on. 

While you’re at it the following would be helpful as well:

# smartctl -l scterc /dev/sdh
# cat /sys/block/sdh/device/state
# cat /sys/block/sdh/device/timeout

These are read-only commands to determine states, they don’t change states so it’s safe.

Chris Murphy--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html