RE: Device naming and raid1

"David Lethe" <david@xxxxxxxxxxxx> · Wed, 27 Aug 2008 06:11:07 -0500



> -----Original Message-----
> From: linux-raid-owner@xxxxxxxxxxxxxxx [mailto:linux-raid-
> owner@xxxxxxxxxxxxxxx] On Behalf Of Tony Coffman
> Sent: Tuesday, August 26, 2008 10:33 AM
> To: linux-raid@xxxxxxxxxxxxxxx
> Subject: Device naming and raid1
> 
> I've have a Centos5 box running a software raid-1 set on a pair of
SATA
> drives.
> 
> The SATA controller or driver has a flaw.
> Every 150 days or so, one of the two drives will experience errors and
> fail.
> 
> Subsequent tests always show the drive and cable to be ok.  We bought
a
> couple of replacement drives before we figured that out :-(
> 
> On the last event this weekend, I went searching for a way to get the
> raid back online with no host downtime.  I found the technique that
> deletes the drive and then brings it back online with a bus scan using
> the /sys filesystem delete and rescan entities.
> 
> I didn't realize that you could also perform a rescan on a single LUN.
> I'll have to use that next time.
> 
> My question - since I've done a delete/rescan bus operation, my device
> name and major,minor numbers have changed.
> 
> Original
> [0:0:0:0]    disk    ATA      ST3250410AS      3.AA  /dev/sdc
> 
> Current
> [0:0:0:0]    disk    ATA      ST3250410AS      3.AA  /dev/sdc
> 
> If I re-add the device to the raid set using the new device name, will
> it cause any problems on the next boot?
> 
> The drive appears to be fine.  I can read all blocks with no errors.
> Partition table looks ok, etc..
> 
> In the future if I rescan just the single LUN, I'm pretty sure I won't
> run into again this but I'd like to avoid an outage on this event if
> possible.
> 
> Thanks and regards,
> --Tony
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid"
> in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.htm
Don't be too quick to say the drive(s) are good, or for that matter,
making any assumptions about what is bad or good. (Well, OK, let's
assume the monitor is good).   If the drives are reporting errors and
the drives fail, why not trap the error messages and do some diagnostics
while drives are still in that failed state?  Error messages tell you
what the errors are.   Make yourself a bootable CDROM or USB and next
time the drives lockup and/or start spitting out errors, then capture
everything.  Then boot to the external device (do NOT cycle power), and
run one of many possible diagnostics to confirm or eliminate the disks.


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html