Re: 2 drive dropout (and raid 5), simultaneous, after 3 years

Doug Ledford <dledford@xxxxxxxxxx> · Wed, 15 Dec 2004 12:45:37 -0500

On Thu, 2004-12-09 at 11:22 -0600, Michael Stumpf wrote:
> Ahhhhhhh.. You're on to something here.  In all my years of ghetto raid 
> one of the weakest things I've seen is the Y-molex-power-splitters.  Do 
> you know where more solid ones can be found?  I'm to the point where I'd 
> pay $10 or more for the bloody things if they didnt blink the power 
> connection when moved a little bit.
> 
> I'll bet good money this is what happened.  Maybe I need to break out 
> the soldering iron, but that's kind of an ugly, proprietary, and slow 
> solution.

Well, that is usually overkill anyway ;-)  I've solved this problem in
the past by simply getting out a pair of thin nose needle nose pliers
and crimping down on the connector's actual grip points.  Once I
tightened up the grip spots on the Y connector, the problem went away.

> 
> 
> Guy wrote:
> 
> >Since they both went off line at the same time, check the power cables.  Do
> >they share a common power cable, or doe each have a unique cable directly
> >from the power supply.
> >
> >Switch power connections with another drive to see if the problem stays with
> >the power connection.
> >
> >Guy
> >
> >-----Original Message-----
> >From: linux-raid-owner@xxxxxxxxxxxxxxx
> >[mailto:linux-raid-owner@xxxxxxxxxxxxxxx] On Behalf Of Michael Stumpf
> >Sent: Thursday, December 09, 2004 9:45 AM
> >To: Guy; linux-raid@xxxxxxxxxxxxxxx
> >Subject: Re: 2 drive dropout (and raid 5), simultaneous, after 3 years
> >
> >All I see is this:
> >
> >Apr 14 22:03:56 drown kernel: scsi: device set offline - not ready or 
> >command retry failed after host reset: host 1 channel 0 id 2 lun 0
> >Apr 14 22:03:56 drown kernel: scsi: device set offline - not ready or 
> >command retry failed after host reset: host 1 channel 0 id 3 lun 0
> >Apr 14 22:03:56 drown kernel: md: updating md1 RAID superblock on device
> >Apr 14 22:03:56 drown kernel: md: (skipping faulty sdj1 )
> >Apr 14 22:03:56 drown kernel: md: (skipping faulty sdi1 )
> >Apr 14 22:03:56 drown kernel: md: sdh1 [events: 000000b5]<6>(write) 
> >sdh1's sb offset: 117186944
> >Apr 14 22:03:56 drown kernel: md: sdg1 [events: 000000b5]<6>(write) 
> >sdg1's sb offset: 117186944
> >Apr 14 22:03:56 drown kernel: md: recovery thread got woken up ...
> >Apr 14 22:03:56 drown kernel: md: recovery thread finished ...
> >
> >What the heck could that be?  Can that possibly be related to the fact 
> >that there weren't proper block device nodes sitting in the filesystem?!
> >
> >I already ran WD's wonky tool to fix their "DMA timeout" problem, and 
> >one of the drives is a maxtor.  They're on separate ATA cables, and I've 
> >got about 5 drives per power supply.  I checked heat, and it wasn't very 
> >high.
> >
> >Any other sources of information I could tap?  Maybe an "MD debug" 
> >setting in the kernel with a recompile?
> >
> >Guy wrote:
> >
> >  
> >
> >>You should have some sort of md error in your logs.  Try this command:
> >>grep "md:" /var/log/messages*|more
> >>
> >>Yes, they don't play well together, so separate them!  :)
> >>
> >>Guy
> >>
> >>-----Original Message-----
> >>From: linux-raid-owner@xxxxxxxxxxxxxxx
> >>[mailto:linux-raid-owner@xxxxxxxxxxxxxxx] On Behalf Of Michael Stumpf
> >>Sent: Wednesday, December 08, 2004 11:46 PM
> >>To: linux-raid@xxxxxxxxxxxxxxx
> >>Subject: Re: 2 drive dropout (and raid 5), simultaneous, after 3 years
> >>
> >>No idea what failure is occuring.  Your dd test, run from begin to end 
> >>of each drive, completed fine.  Smartd had no info to report.
> >>
> >>The fdisk weirdness was operator error; the /dev/sd* block nodes were 
> >>missing (forgotten detail on age old upgrade).  Fixed with mknod.
> >>
> >>So, I forced mdadm to assemble and it is reconstructing now.  
> >>Troublesome, though, that 2 drives fail at once like this.  I think I 
> >>should separate them to different raid-5s, just incase.
> >>
> >>
> >>
> >>Guy wrote:
> >>
> >> 
> >>
> >>    
> >>
> >>>What failure are you getting?  I assume a read error.  md will fail a
> >>>      
> >>>
> >drive
> >  
> >
> >>>when it gets a read error from the drive.  It is "normal" to have a read
> >>>error once in a while, but more than 1 a year may indicate a drive going
> >>>bad.
> >>>
> >>>I test my drives with this command:
> >>>dd if=/dev/hdi of=/dev/null bs=64k
> >>>
> >>>You may look into using "smartd".  It monitors and tests disks for
> >>>   
> >>>
> >>>      
> >>>
> >>problems.
> >> 
> >>
> >>    
> >>
> >>>However, my dd test finds them first.  smartd has never told me anything
> >>>useful, but my drives are old, and are not smart enough for smartd.
> >>>
> >>>Guy
> >>>
> >>>-----Original Message-----
> >>>From: linux-raid-owner@xxxxxxxxxxxxxxx
> >>>[mailto:linux-raid-owner@xxxxxxxxxxxxxxx] On Behalf Of Michael Stumpf
> >>>Sent: Wednesday, December 08, 2004 4:03 PM
> >>>To: linux-raid@xxxxxxxxxxxxxxx
> >>>Subject: 2 drive dropout (and raid 5), simultaneous, after 3 years
> >>>
> >>>
> >>>I've got a an LVM cobbled together of 2 RAID-5 md's.  For the longest 
> >>>time I was running with 3 promise cards and surviving everything 
> >>>including the occasional drive failure, then suddenly I had double drive 
> >>>dropouts and the array would go into a degraded state.
> >>>
> >>>10 drives in the system, Linux 2.4.22, Slackware 9, mdadm v1.2.0 (13 mar 
> >>>2003)
> >>>
> >>>I started to diagnose; fdisk -l /dev/hdi  returned nothing for the two 
> >>>failed drives, but "dmesg" reports that the drives are happy, and that 
> >>>the md would have been automounted if not for a mismatch on the event 
> >>>counters (of the 2 failed drives).
> >>>
> >>>I assumed that this had something to do with my semi-nonstandard 
> >>>application of a zillion (3) promise cards in 1 system, but I never had 
> >>>this problem before.  I ripped out the promise cards and stuck in 3ware 
> >>>5700s, cleaning it up a bit and also putting a single drive per ATA 
> >>>channel.  Two weeks later, the same problem crops up again.
> >>>
> >>>The "problematic" drives are even mixed; 1 is WD, 1 is Maxtor (both
> >>>   
> >>>
> >>>      
> >>>
> >>120gig).
> >> 
> >>
> >>    
> >>
> >>>Is this a known bug in 2.4.22 or mdadm 1.2.0?  Suggestions?
> >>>
> >>>
> >>>--------------------------------------------
> >>>My mailbox is spam-free with ChoiceMail, the leader in personal and
> >>>corporate anti-spam solutions. Download your free copy of ChoiceMail from
> >>>www.choicemailfree.com
> >>>
> >>>-
> >>>To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> >>>the body of a message to majordomo@xxxxxxxxxxxxxxx
> >>>More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>>
> >>>-
> >>>To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> >>>the body of a message to majordomo@xxxxxxxxxxxxxxx
> >>>More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>>
> >>>
> >>>
> >>>
> >>>   
> >>>
> >>>      
> >>>
> >>--------------------------------------------
> >>My mailbox is spam-free with ChoiceMail, the leader in personal and
> >>corporate anti-spam solutions. Download your free copy of ChoiceMail from
> >>www.choicemailfree.com
> >>
> >>-
> >>To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> >>the body of a message to majordomo@xxxxxxxxxxxxxxx
> >>More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>
> >>-
> >>To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> >>the body of a message to majordomo@xxxxxxxxxxxxxxx
> >>More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>
> >>
> >> 
> >>
> >>    
> >>
> >
> >
> >--------------------------------------------
> >My mailbox is spam-free with ChoiceMail, the leader in personal and
> >corporate anti-spam solutions. Download your free copy of ChoiceMail from
> >www.choicemailfree.com
> >
> >-
> >To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> >the body of a message to majordomo@xxxxxxxxxxxxxxx
> >More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> >
> >  
> >
> 
> 
> --------------------------------------------
> My mailbox is spam-free with ChoiceMail, the leader in personal and corporate anti-spam solutions. Download your free copy of ChoiceMail from www.choicemailfree.com
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
  Doug Ledford <dledford@xxxxxxxxxx>
         Red Hat, Inc.
         1801 Varsity Dr.
         Raleigh, NC 27606

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html