Re: How do I tell which disk failed?

Ross Boylan <ross@xxxxxxxxxxxxxxxx> · Mon, 07 Jan 2013 23:49:46 -0800

On Tue, 2013-01-08 at 00:17 -0700, Chris Murphy wrote:
> On Jan 7, 2013, at 11:59 PM, Ross Boylan <ross@xxxxxxxxxxxxxxxx> wrote:
> >> 
> > Isn't it possible there's a hardware problem, e.g., leading to a
> > failure/retry cycle?
> 
> smartctl -a /dev/sda
> smartctl -a /dev/sdb
> smartctl -a /dev/sdc
> 
> Compare them. If there was a write failure reported by the drive, md would have marked the device faulty.
SMART seems to think they are all OK, though my understanding of it is
limited (e.g., the logs showed SMART reporting Temperature_Celsius of
110, but I think that's a normalized value for a raw of 42, meaning the
temp is 42 degrees celsius).  Do I need to manually run a test before
the report reflects current conditions?  At any rate, I did (just a
short one), and the drives passed.

The raw value (last column) for one of the parameters seems to be
changing extremely rapidly, and perhaps is overflowing:
# date; smartctl -a /dev/sda | grep 195
Mon Jan  7 23:11:03 PST 2013
195 Hardware_ECC_Recovered  0x001a   059   024   000    Old_age   Always       -       241377818
# date; smartctl -a /dev/sda | grep 195
Mon Jan  7 23:12:26 PST 2013
195 Hardware_ECC_Recovered  0x001a   056   024   000    Old_age   Always       -       3600778
Perhaps someone on this list can interpret that better than I.

My thought was disk failure (not necessarily complete failure) -> system
lockup.  Continued disk flakiness leads to continued slowness after
restart as, e.g., the disk keeps retrying operations that fail.

I infer you have a different scenario in mind: the system freaks out for
a reason unrelated to the disk.  The resulting shutdown (which was a
manual power off) leaves the arrays and their components in a funky
state.  When the system comes back, it fixes things up.

Even if this did happen, in RAID 1 wouldn't some of the componnents
(partitions in my case) be deemed good and others bad, with the latter
resynced to match the former?  And if that is happening, why can't I
tell which partition(s) are master (considered good) and which are not
(being overwritten with contents of the master)?

The sync just completed, so I can no longer poke around while the
rebuild is in process.  Bad for learning and diagnosis, but good for
almost every other purpose.

Ross

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html