Re[2]: Raid 6 Fail Event

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Thank-you, Chris. I appreciate your help with this.

Backup are good. I'm a regular disk to disk to LTO guy. Here is what I have turned up:

================================
# smartctl -x /dev/sdh

big long list of stuff. I found the serial.

I also tried smartctl -H /dev/sdh and received

Overall-health self-assesment test restul: PASSED

184 End-to-End_Error {flag value worst thresh} Old_age FAILING_NOW_6

I did not find anything for the serial in results from dmesg

# smartctl -l scterc /dev/sdh

Warning: device does not support SCT Commands

# cat /sys/block/sdh/device/state

Running

# cat /sys/block/sdh/device/timeout

30

================================

Should I replace the drive or re add and resync?

I also went through and reseated all the SATA and power connections as I understand these can cause issues as well.

Best,

- J


------ Original Message ------
From: "Chris Murphy" <lists@xxxxxxxxxxxxxxxxx>
To: "Justin Stephenson" <justin@xxxxxxxxxxxxxxxxx>
Cc: linux-raid@xxxxxxxxxxxxxxx
Sent: 16/11/2014 2:52:02 PM
Subject: Re: Raid 6 Fail Event


On Nov 16, 2014, at 8:39 AM, Justin Stephenson <justin@xxxxxxxxxxxxxxxxx> wrote:

 Hello,

I am new to MDADM and have just experienced my first device fail on my raid 6.

I am wondering if someone might be able to help by outlining a proper protocol for troubleshooting and rebuilding this array (proc/mdstat below).

 Here is how I might approach it:

 - remove the device
 - test the device
 - if the device tests OK then re add the device
 - if the device fails, then replace the device
 - resync

 Thank-you for your consideration.

 Best,

 - Justin

 Here is the mdstat email

 -----------------

 This is an automatically generated mail message from mdadm
 running on BigBlue

 A Fail event had been detected on md device /dev/md0.

 It could be related to component device /dev/sdh1.

First step is getting the backup current.

Second you can do this without removing the device:

# smartctl -x /dev/sdh

And then look in dmesg for errors related to its ata designation. You should be able to get a serial number from the smartctl output and can search that with dmesg | grep <serial#> to find out what it’s ata designation (port and device number) is, then you can dmesg | grep ataX.YY to get any read/write error events that explain what’s going on.

While you’re at it the following would be helpful as well:

# smartctl -l scterc /dev/sdh
# cat /sys/block/sdh/device/state
# cat /sys/block/sdh/device/timeout

These are read-only commands to determine states, they don’t change states so it’s safe.

Chris Murphy

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux RAID Wiki]     [ATA RAID]     [Linux SCSI Target Infrastructure]     [Linux Block]     [Linux IDE]     [Linux SCSI]     [Linux Hams]     [Device Mapper]     [Device Mapper Cryptographics]     [Kernel]     [Linux Admin]     [Linux Net]     [GFS]     [RPM]     [git]     [Yosemite Forum]


  Powered by Linux