Re[2]: Raid 6 Fail Event

"Justin Stephenson" <justin@xxxxxxxxxxxxxxxxx> · Mon, 17 Nov 2014 01:34:38 +0000

Thank-you, Chris. I appreciate your help with this.

Backup are good. I'm a regular disk to disk to LTO guy. Here is what I 
have turned up:

================================
# smartctl -x /dev/sdh

big long list of stuff. I found the serial.

I also tried smartctl -H /dev/sdh and received

Overall-health self-assesment test restul: PASSED

184 End-to-End_Error {flag value worst thresh} Old_age FAILING_NOW_6

I did not find anything for the serial in results from dmesg

# smartctl -l scterc /dev/sdh

Warning: device does not support SCT Commands

# cat /sys/block/sdh/device/state

Running

# cat /sys/block/sdh/device/timeout

30

================================

Should I replace the drive or re add and resync?

I also went through and reseated all the SATA and power connections as I 
understand these can cause issues as well.

Best,

- J

------ Original Message ------
From: "Chris Murphy" <lists@xxxxxxxxxxxxxxxxx>
To: "Justin Stephenson" <justin@xxxxxxxxxxxxxxxxx>
Cc: linux-raid@xxxxxxxxxxxxxxx
Sent: 16/11/2014 2:52:02 PM
Subject: Re: Raid 6 Fail Event

On Nov 16, 2014, at 8:39 AM, Justin Stephenson 
<justin@xxxxxxxxxxxxxxxxx> wrote:

 Hello,

 I am new to MDADM and have just experienced my first device fail on 
my raid 6.

 I am wondering if someone might be able to help by outlining a proper 
protocol for troubleshooting and rebuilding this array (proc/mdstat 
below).

 Here is how I might approach it:

 - remove the device
 - test the device
 - if the device tests OK then re add the device
 - if the device fails, then replace the device
 - resync

 Thank-you for your consideration.

 Best,

 - Justin

 Here is the mdstat email

 -----------------

 This is an automatically generated mail message from mdadm
 running on BigBlue

 A Fail event had been detected on md device /dev/md0.

 It could be related to component device /dev/sdh1.

First step is getting the backup current.

Second you can do this without removing the device:

# smartctl -x /dev/sdh

And then look in dmesg for errors related to its ata designation. You 
should be able to get a serial number from the smartctl output and can 
search that with dmesg | grep <serial#> to find out what it’s ata 
designation (port and device number) is, then you can dmesg | grep 
ataX.YY to get any read/write error events that explain what’s going 
on.

While you’re at it the following would be helpful as well:

# smartctl -l scterc /dev/sdh
# cat /sys/block/sdh/device/state
# cat /sys/block/sdh/device/timeout

These are read-only commands to determine states, they don’t change 
states so it’s safe.

Chris Murphy

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html