Re: RE: RE: RAID5 Not coming back up after crash

BERNARD JOHN ZOLP <bjzolp@xxxxxxxxxxxxxxxxx> · Mon, 29 Nov 2004 16:35:50 -0600

I believe the two flakey ones are Maxtors, so chalk up another maxtor
hater I guess.  Not sure how old they are, I will have to double check
their serials -- probably within warranty -- I hope.

Thanks, I will let you know how this all goes.

bjz

----- Original Message -----
From: Guy <bugzilla@xxxxxxxxxxxxxxxx>
Date: Monday, November 29, 2004 4:29 pm
Subject: RE: RE: RAID5 Not coming back up after crash

> If you are sure you can overwrite the correct bad sectors, then do it.
> 
> mdadm is much better than raidtools.  From what I have read, yes it is
> compatible.
> 
> The below info is not required.
> Who makes your 6 disk drives?  And how old are they?  Any bets anyone?
> 
> Guy
> 
> -----Original Message-----
> From: BERNARD JOHN ZOLP [mailto:bjzolp@xxxxxxxxxxxxxxxxx] 
> Sent: Monday, November 29, 2004 3:57 PM
> To: Guy
> Cc: linux-raid@xxxxxxxxxxxxxxx
> Subject: Re: RE: RAID5 Not coming back up after crash
> 
> Just a few follow up questions before I dive into this.  Will mdadm 
> workwith a RAID setup created with the older raidtools package that 
> camewith my SuSE installation?
>  Assuming the drive with bad blocks is not getting worse, dont 
> think it
> is -- but you never know, could I map them out by writing to those
> sectors with dd and then running the command to bring the array back
> online?  Or should I wait for the RMA of the flakey drive and 
> dd_rescueto the new one and bring that up?
> 
> Thanks again,
> bjz
> 
> ----- Original Message -----
> From: Guy <bugzilla@xxxxxxxxxxxxxxxx>
> Date: Monday, November 29, 2004 11:40 am
> Subject: RE: RAID5 Not coming back up after crash
> 
> > You can recover, but not with bad blocks.
> > 
> > This command should get your array back on-line:
> > mdadm -A /dev/md0 --force /dev/hda1 /dev/hdc1 /dev/hdd1 /dev/hdi1 
> > /dev/hdj1
> > But, as soon as md reads a bad block it will fail the disk and 
> your 
> > arraywill be off-line.
> > 
> > If you have an extra disk, you could attempt to copy the disk 
> > first, then
> > replace the disk with the read error with the copy.
> > 
> > dd_rescue can copy a disk with read errors.
> > 
> > Also, it is common for a disk to grow bad spots over time.  These 
> > bad spots
> > (sectors) can be re-mapped by the drive to a spare sector.  This 
> re-
> > mappingwill occur when an attempt is made to write to the bad 
> > sector.  So, you can
> > repair your disk by writing to the bad sectors.  But, be careful 
> > not to
> > overwrite good data.  I have done this using dd.  First I found 
> the 
> > badsector with dd, then I wrote to the 1 bad sector with dd.  I 
> > would need to
> > refer to the man page to do it again, so I can't explain it here 
> at 
> > thistime.  It is not really hard, but 1 small mistake, and 
> "that's 
> > it man, game
> > over man, game over".
> > 
> > Guy
> > 
> > 
> > -----Original Message-----
> > From: linux-raid-owner@xxxxxxxxxxxxxxx
> > [mailto:linux-raid-owner@xxxxxxxxxxxxxxx] On Behalf Of B. J. Zolp
> > Sent: Monday, November 29, 2004 11:33 AM
> > To: linux-raid@xxxxxxxxxxxxxxx
> > Subject: RAID5 Not coming back up after crash
> > 
> > I have a RAID5 setup on my fileserver using disks hda1 hdb1 hdc1 
> > hdd1 
> > hdi1 and hdj1.  Yesterday I started moving a large chunk of files 
> > ~80GB 
> > from this array to a stand alone drive in the system and about 
> > halfway 
> > through the mv I got a ton of PERMISSION DENIED errors some of 
> the 
> > remaining files left to be moved and the move process quit.  I 
> did 
> > a ls 
> > of the raid directory and got PERMISSION DENIED on the same files 
> > that 
> > errored out on the mv while some of the other files looked fine.  
> I 
> > figured it might be a good idea to take the raid down and back up 
> > again 
> > (probably a mistake) and I could not reboot the machine without 
> > physically turning it off as some processes were hung.  Upon 
> > booting 
> > back up, the raid did not come online stating that hdj1 was 
> kicked 
> > due 
> > to inconsistency.  Additionally hdb1 is listed as offline too.  
> So 
> > I 
> > have 2 drives that are not cooperating.  I have a hunch hdb1 
> might 
> > have 
> > not been working for some time.
> > 
> > I found some info stating that if you mark the drive that failed 
> > first 
> > as "failed-drive" and try a  "mkraid --force --dangerous-no-
> resync 
> > /dev/md0" then I might have some luck getting my files back.  
> From 
> > my 
> > logs I can see that all the working drives have event counter: 
> > 00000022 
> > and hdj1 has event counter: 00000021 and hdb1 has event counter: 
> > 00000001.  Does this mean that hdb1 failed a log time ago or is 
> > this 
> > difference in event counters likely within a few minutes fo each 
> > other?  
> > I just ran badblocks on both hdb1 and hdj1 and found 1 bad block 
> on 
> > hdb1 
> > and about 15 on hdj1, would that be enough to cause my raid to 
> get 
> > this 
> > out of whack?  In any case I plan to replace those drives, but 
> > would the 
> > method above be the best route once I have copied the raw data to 
> > the 
> > new drives in order to bring my raid back up?
> > 
> > 
> > Thanks,
> > 
> > bjz
> > 
> > here is my log from when I run raidstart /dev/md0:
> > 
> > Nov 29 10:10:19 orion kernel:  [events: 00000022]
> > Nov 29 10:10:19 orion last message repeated 3 times
> > Nov 29 10:10:19 orion kernel:  [events: 00000021]
> > Nov 29 10:10:19 orion kernel: md: autorun ...
> > Nov 29 10:10:19 orion kernel: md: considering hdj1 ...
> > Nov 29 10:10:19 orion kernel: md:  adding hdj1 ...
> > Nov 29 10:10:19 orion kernel: md:  adding hdi1 ...
> > Nov 29 10:10:19 orion kernel: md:  adding hdd1 ...
> > Nov 29 10:10:19 orion kernel: md:  adding hdc1 ...
> > Nov 29 10:10:19 orion kernel: md:  adding hda1 ...
> > Nov 29 10:10:19 orion kernel: md: created md0
> > Nov 29 10:10:19 orion kernel: md: bind<hda1,1>
> > Nov 29 10:10:19 orion kernel: md: bind<hdc1,2>
> > Nov 29 10:10:19 orion kernel: md: bind<hdd1,3>
> > Nov 29 10:10:19 orion kernel: md: bind<hdi1,4>
> > Nov 29 10:10:19 orion kernel: md: bind<hdj1,5>
> > Nov 29 10:10:19 orion kernel: md: running: 
> > <hdj1><hdi1><hdd1><hdc1><hda1>Nov 29 10:10:19 orion kernel: md: 
> > hdj1's event counter: 00000021
> > Nov 29 10:10:19 orion kernel: md: hdi1's event counter: 00000022
> > Nov 29 10:10:19 orion kernel: md: hdd1's event counter: 00000022
> > Nov 29 10:10:19 orion kernel: md: hdc1's event counter: 00000022
> > Nov 29 10:10:19 orion kernel: md: hda1's event counter: 00000022
> > Nov 29 10:10:19 orion kernel: md: superblock update time 
> > inconsistency 
> > -- using the most recent one
> > Nov 29 10:10:19 orion kernel: md: freshest: hdi1
> > Nov 29 10:10:19 orion kernel: md0: kicking faulty hdj1!
> > Nov 29 10:10:19 orion kernel: md: unbind<hdj1,4>
> > Nov 29 10:10:19 orion kernel: md: export_rdev(hdj1)
> > Nov 29 10:10:19 orion kernel: md: md0: raid array is not clean -- 
> > starting background reconstruction
> > Nov 29 10:10:19 orion kernel: md0: max total readahead window set 
> > to 2560k
> > Nov 29 10:10:19 orion kernel: md0: 5 data-disks, max readahead 
> per 
> > data-disk: 512k
> > Nov 29 10:10:19 orion kernel: raid5: device hdi1 operational as 
> > raid disk 4
> > Nov 29 10:10:19 orion kernel: raid5: device hdd1 operational as 
> > raid disk 3
> > Nov 29 10:10:19 orion kernel: raid5: device hdc1 operational as 
> > raid disk 2
> > Nov 29 10:10:19 orion kernel: raid5: device hda1 operational as 
> > raid disk 0
> > Nov 29 10:10:19 orion kernel: raid5: not enough operational 
> devices 
> > for 
> > md0 (2/6 failed)
> > Nov 29 10:10:19 orion kernel: RAID5 conf printout:
> > Nov 29 10:10:19 orion kernel:  --- rd:6 wd:4 fd:2
> > Nov 29 10:10:19 orion kernel:  disk 0, s:0, o:1, n:0 rd:0 us:1 
> > dev:hda1Nov 29 10:10:19 orion kernel:  disk 1, s:0, o:0, n:1 rd:1 
> > us:1 dev:[dev 
> > 00:00]
> > Nov 29 10:10:19 orion kernel:  disk 2, s:0, o:1, n:2 rd:2 us:1 
> > dev:hdc1Nov 29 10:10:19 orion kernel:  disk 3, s:0, o:1, n:3 rd:3 
> > us:1 dev:hdd1
> > Nov 29 10:10:19 orion kernel:  disk 4, s:0, o:1, n:4 rd:4 us:1 
> > dev:hdi1Nov 29 10:10:19 orion kernel:  disk 5, s:0, o:0, n:5 rd:5 
> > us:1 dev:[dev 
> > 00:00]
> > Nov 29 10:10:19 orion kernel: raid5: failed to run raid set md0
> > Nov 29 10:10:19 orion kernel: md: pers->run() failed ...
> > Nov 29 10:10:19 orion kernel: md :do_md_run() returned -22
> > Nov 29 10:10:19 orion kernel: md: md0 stopped.
> > Nov 29 10:10:19 orion kernel: md: unbind<hdi1,3>
> > Nov 29 10:10:19 orion kernel: md: export_rdev(hdi1)
> > Nov 29 10:10:19 orion kernel: md: unbind<hdd1,2>
> > Nov 29 10:10:19 orion kernel: md: export_rdev(hdd1)
> > Nov 29 10:10:19 orion kernel: md: unbind<hdc1,1>
> > Nov 29 10:10:19 orion kernel: md: export_rdev(hdc1)
> > Nov 29 10:10:19 orion kernel: md: unbind<hda1,0>
> > Nov 29 10:10:19 orion kernel: md: export_rdev(hda1)
> > Nov 29 10:10:19 orion kernel: md: ... autorun DONE.
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-
> > raid" in
> > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> > -- 
> > No virus found in this incoming message.
> > Checked by AVG Anti-Virus.
> > Version: 7.0.289 / Virus Database: 265.4.3 - Release Date: 
> 11/26/2004> 
> > 
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-
> > raid" in
> > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> 
> 
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html