Thank-you, Chris. I appreciate your help with this.
Backup are good. I'm a regular disk to disk to LTO guy. Here is what I
have turned up:
================================
# smartctl -x /dev/sdh
big long list of stuff. I found the serial.
I also tried smartctl -H /dev/sdh and received
Overall-health self-assesment test restul: PASSED
184 End-to-End_Error {flag value worst thresh} Old_age FAILING_NOW_6
I did not find anything for the serial in results from dmesg
# smartctl -l scterc /dev/sdh
Warning: device does not support SCT Commands
# cat /sys/block/sdh/device/state
Running
# cat /sys/block/sdh/device/timeout
30
================================
Should I replace the drive or re add and resync?
I also went through and reseated all the SATA and power connections as I
understand these can cause issues as well.
Best,
- J
------ Original Message ------
From: "Chris Murphy" <lists@xxxxxxxxxxxxxxxxx>
To: "Justin Stephenson" <justin@xxxxxxxxxxxxxxxxx>
Cc: linux-raid@xxxxxxxxxxxxxxx
Sent: 16/11/2014 2:52:02 PM
Subject: Re: Raid 6 Fail Event
On Nov 16, 2014, at 8:39 AM, Justin Stephenson
<justin@xxxxxxxxxxxxxxxxx> wrote:
Hello,
I am new to MDADM and have just experienced my first device fail on
my raid 6.
I am wondering if someone might be able to help by outlining a proper
protocol for troubleshooting and rebuilding this array (proc/mdstat
below).
Here is how I might approach it:
- remove the device
- test the device
- if the device tests OK then re add the device
- if the device fails, then replace the device
- resync
Thank-you for your consideration.
Best,
- Justin
Here is the mdstat email
-----------------
This is an automatically generated mail message from mdadm
running on BigBlue
A Fail event had been detected on md device /dev/md0.
It could be related to component device /dev/sdh1.
First step is getting the backup current.
Second you can do this without removing the device:
# smartctl -x /dev/sdh
And then look in dmesg for errors related to its ata designation. You
should be able to get a serial number from the smartctl output and can
search that with dmesg | grep <serial#> to find out what it’s ata
designation (port and device number) is, then you can dmesg | grep
ataX.YY to get any read/write error events that explain what’s going
on.
While you’re at it the following would be helpful as well:
# smartctl -l scterc /dev/sdh
# cat /sys/block/sdh/device/state
# cat /sys/block/sdh/device/timeout
These are read-only commands to determine states, they don’t change
states so it’s safe.
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html