I believe the two flakey ones are Maxtors, so chalk up another maxtor hater I guess. Not sure how old they are, I will have to double check their serials -- probably within warranty -- I hope. Thanks, I will let you know how this all goes. bjz ----- Original Message ----- From: Guy <bugzilla@xxxxxxxxxxxxxxxx> Date: Monday, November 29, 2004 4:29 pm Subject: RE: RE: RAID5 Not coming back up after crash > If you are sure you can overwrite the correct bad sectors, then do it. > > mdadm is much better than raidtools. From what I have read, yes it is > compatible. > > The below info is not required. > Who makes your 6 disk drives? And how old are they? Any bets anyone? > > Guy > > -----Original Message----- > From: BERNARD JOHN ZOLP [mailto:bjzolp@xxxxxxxxxxxxxxxxx] > Sent: Monday, November 29, 2004 3:57 PM > To: Guy > Cc: linux-raid@xxxxxxxxxxxxxxx > Subject: Re: RE: RAID5 Not coming back up after crash > > Just a few follow up questions before I dive into this. Will mdadm > workwith a RAID setup created with the older raidtools package that > camewith my SuSE installation? > Assuming the drive with bad blocks is not getting worse, dont > think it > is -- but you never know, could I map them out by writing to those > sectors with dd and then running the command to bring the array back > online? Or should I wait for the RMA of the flakey drive and > dd_rescueto the new one and bring that up? > > Thanks again, > bjz > > ----- Original Message ----- > From: Guy <bugzilla@xxxxxxxxxxxxxxxx> > Date: Monday, November 29, 2004 11:40 am > Subject: RE: RAID5 Not coming back up after crash > > > You can recover, but not with bad blocks. > > > > This command should get your array back on-line: > > mdadm -A /dev/md0 --force /dev/hda1 /dev/hdc1 /dev/hdd1 /dev/hdi1 > > /dev/hdj1 > > But, as soon as md reads a bad block it will fail the disk and > your > > arraywill be off-line. > > > > If you have an extra disk, you could attempt to copy the disk > > first, then > > replace the disk with the read error with the copy. > > > > dd_rescue can copy a disk with read errors. > > > > Also, it is common for a disk to grow bad spots over time. These > > bad spots > > (sectors) can be re-mapped by the drive to a spare sector. This > re- > > mappingwill occur when an attempt is made to write to the bad > > sector. So, you can > > repair your disk by writing to the bad sectors. But, be careful > > not to > > overwrite good data. I have done this using dd. First I found > the > > badsector with dd, then I wrote to the 1 bad sector with dd. I > > would need to > > refer to the man page to do it again, so I can't explain it here > at > > thistime. It is not really hard, but 1 small mistake, and > "that's > > it man, game > > over man, game over". > > > > Guy > > > > > > -----Original Message----- > > From: linux-raid-owner@xxxxxxxxxxxxxxx > > [mailto:linux-raid-owner@xxxxxxxxxxxxxxx] On Behalf Of B. J. Zolp > > Sent: Monday, November 29, 2004 11:33 AM > > To: linux-raid@xxxxxxxxxxxxxxx > > Subject: RAID5 Not coming back up after crash > > > > I have a RAID5 setup on my fileserver using disks hda1 hdb1 hdc1 > > hdd1 > > hdi1 and hdj1. Yesterday I started moving a large chunk of files > > ~80GB > > from this array to a stand alone drive in the system and about > > halfway > > through the mv I got a ton of PERMISSION DENIED errors some of > the > > remaining files left to be moved and the move process quit. I > did > > a ls > > of the raid directory and got PERMISSION DENIED on the same files > > that > > errored out on the mv while some of the other files looked fine. > I > > figured it might be a good idea to take the raid down and back up > > again > > (probably a mistake) and I could not reboot the machine without > > physically turning it off as some processes were hung. Upon > > booting > > back up, the raid did not come online stating that hdj1 was > kicked > > due > > to inconsistency. Additionally hdb1 is listed as offline too. > So > > I > > have 2 drives that are not cooperating. I have a hunch hdb1 > might > > have > > not been working for some time. > > > > I found some info stating that if you mark the drive that failed > > first > > as "failed-drive" and try a "mkraid --force --dangerous-no- > resync > > /dev/md0" then I might have some luck getting my files back. > From > > my > > logs I can see that all the working drives have event counter: > > 00000022 > > and hdj1 has event counter: 00000021 and hdb1 has event counter: > > 00000001. Does this mean that hdb1 failed a log time ago or is > > this > > difference in event counters likely within a few minutes fo each > > other? > > I just ran badblocks on both hdb1 and hdj1 and found 1 bad block > on > > hdb1 > > and about 15 on hdj1, would that be enough to cause my raid to > get > > this > > out of whack? In any case I plan to replace those drives, but > > would the > > method above be the best route once I have copied the raw data to > > the > > new drives in order to bring my raid back up? > > > > > > Thanks, > > > > bjz > > > > here is my log from when I run raidstart /dev/md0: > > > > Nov 29 10:10:19 orion kernel: [events: 00000022] > > Nov 29 10:10:19 orion last message repeated 3 times > > Nov 29 10:10:19 orion kernel: [events: 00000021] > > Nov 29 10:10:19 orion kernel: md: autorun ... > > Nov 29 10:10:19 orion kernel: md: considering hdj1 ... > > Nov 29 10:10:19 orion kernel: md: adding hdj1 ... > > Nov 29 10:10:19 orion kernel: md: adding hdi1 ... > > Nov 29 10:10:19 orion kernel: md: adding hdd1 ... > > Nov 29 10:10:19 orion kernel: md: adding hdc1 ... > > Nov 29 10:10:19 orion kernel: md: adding hda1 ... > > Nov 29 10:10:19 orion kernel: md: created md0 > > Nov 29 10:10:19 orion kernel: md: bind<hda1,1> > > Nov 29 10:10:19 orion kernel: md: bind<hdc1,2> > > Nov 29 10:10:19 orion kernel: md: bind<hdd1,3> > > Nov 29 10:10:19 orion kernel: md: bind<hdi1,4> > > Nov 29 10:10:19 orion kernel: md: bind<hdj1,5> > > Nov 29 10:10:19 orion kernel: md: running: > > <hdj1><hdi1><hdd1><hdc1><hda1>Nov 29 10:10:19 orion kernel: md: > > hdj1's event counter: 00000021 > > Nov 29 10:10:19 orion kernel: md: hdi1's event counter: 00000022 > > Nov 29 10:10:19 orion kernel: md: hdd1's event counter: 00000022 > > Nov 29 10:10:19 orion kernel: md: hdc1's event counter: 00000022 > > Nov 29 10:10:19 orion kernel: md: hda1's event counter: 00000022 > > Nov 29 10:10:19 orion kernel: md: superblock update time > > inconsistency > > -- using the most recent one > > Nov 29 10:10:19 orion kernel: md: freshest: hdi1 > > Nov 29 10:10:19 orion kernel: md0: kicking faulty hdj1! > > Nov 29 10:10:19 orion kernel: md: unbind<hdj1,4> > > Nov 29 10:10:19 orion kernel: md: export_rdev(hdj1) > > Nov 29 10:10:19 orion kernel: md: md0: raid array is not clean -- > > starting background reconstruction > > Nov 29 10:10:19 orion kernel: md0: max total readahead window set > > to 2560k > > Nov 29 10:10:19 orion kernel: md0: 5 data-disks, max readahead > per > > data-disk: 512k > > Nov 29 10:10:19 orion kernel: raid5: device hdi1 operational as > > raid disk 4 > > Nov 29 10:10:19 orion kernel: raid5: device hdd1 operational as > > raid disk 3 > > Nov 29 10:10:19 orion kernel: raid5: device hdc1 operational as > > raid disk 2 > > Nov 29 10:10:19 orion kernel: raid5: device hda1 operational as > > raid disk 0 > > Nov 29 10:10:19 orion kernel: raid5: not enough operational > devices > > for > > md0 (2/6 failed) > > Nov 29 10:10:19 orion kernel: RAID5 conf printout: > > Nov 29 10:10:19 orion kernel: --- rd:6 wd:4 fd:2 > > Nov 29 10:10:19 orion kernel: disk 0, s:0, o:1, n:0 rd:0 us:1 > > dev:hda1Nov 29 10:10:19 orion kernel: disk 1, s:0, o:0, n:1 rd:1 > > us:1 dev:[dev > > 00:00] > > Nov 29 10:10:19 orion kernel: disk 2, s:0, o:1, n:2 rd:2 us:1 > > dev:hdc1Nov 29 10:10:19 orion kernel: disk 3, s:0, o:1, n:3 rd:3 > > us:1 dev:hdd1 > > Nov 29 10:10:19 orion kernel: disk 4, s:0, o:1, n:4 rd:4 us:1 > > dev:hdi1Nov 29 10:10:19 orion kernel: disk 5, s:0, o:0, n:5 rd:5 > > us:1 dev:[dev > > 00:00] > > Nov 29 10:10:19 orion kernel: raid5: failed to run raid set md0 > > Nov 29 10:10:19 orion kernel: md: pers->run() failed ... > > Nov 29 10:10:19 orion kernel: md :do_md_run() returned -22 > > Nov 29 10:10:19 orion kernel: md: md0 stopped. > > Nov 29 10:10:19 orion kernel: md: unbind<hdi1,3> > > Nov 29 10:10:19 orion kernel: md: export_rdev(hdi1) > > Nov 29 10:10:19 orion kernel: md: unbind<hdd1,2> > > Nov 29 10:10:19 orion kernel: md: export_rdev(hdd1) > > Nov 29 10:10:19 orion kernel: md: unbind<hdc1,1> > > Nov 29 10:10:19 orion kernel: md: export_rdev(hdc1) > > Nov 29 10:10:19 orion kernel: md: unbind<hda1,0> > > Nov 29 10:10:19 orion kernel: md: export_rdev(hda1) > > Nov 29 10:10:19 orion kernel: md: ... autorun DONE. > > - > > To unsubscribe from this list: send the line "unsubscribe linux- > > raid" in > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > -- > > No virus found in this incoming message. > > Checked by AVG Anti-Virus. > > Version: 7.0.289 / Virus Database: 265.4.3 - Release Date: > 11/26/2004> > > > > - > > To unsubscribe from this list: send the line "unsubscribe linux- > > raid" in > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html