Re: Failed RAID-5 with 4 disks

Tyler <pml@xxxxxxxx> · Tue, 26 Jul 2005 16:12:02 -0700

I had a typo in my original email, near the end, where I say "was 
fubar... then I would try the above steps, but force the assemble with 
the original failed disk", I actually meant to say "but force the 
assemble with the newly DD'd copy of the original/first drive that failed."

Tyler.

Tyler wrote:

My suggestion would be to buy two new drives, and DD (or dd rescue) 
the two bad drives onto the new drives, and then plug the new drive 
that has the most recent failure on it (HDG?) in, and try running a 
forced assemble including the HDG drive, then, in readonly mode, run 
an fsck to check the file system, and see if it thinks most things are 
okay.  *IF* it checks out okay (for the most part.. you will probably 
lose some data), then plug the second new disk in, and add it to the 
array as a spare, and it would then start a resync of the array.  
Otherwise, if the fsck found that the entire filesystem was fubar... 
then I would try the above steps, but force the assemble with the 
original failed disk.. but depending on how long in between the two 
failures its been, and if any data was written to the array after the 
first failure, this is probably not going to be a good thing.. but 
could still be useful if you were trying to recover specific files 
that were not touched in between the two failures.

I would also suggest googling raid manual recovery procedures, some 
info is outdated, but some of it describes what I just described above.

Tyler.

Frank Blendinger wrote:

Hi,

I have a RAID-5 set up with the following raidtab:

raiddev /dev/md0
       raid-level              5
       nr-raid-disks           4
       nr-spare-disks          0
       persistent-superblock   1
       parity-algorithm        left-symmetric
       chunk-size              256
       device                  /dev/hde
       raid-disk               0
       device                  /dev/hdg
       raid-disk               1
       device                  /dev/hdi
       raid-disk               2
       device                  /dev/hdk
       raid-disk               3

My hde has failed some time ago, leaving some     hde: dma_intr: 
status=0x51 { DriveReady SeekComplete Error }
    hde: dma_intr: error=0x84 { DriveStatusError BadCRC }
messages in the syslog.

I wanted to get sure it really was damaged, so I did a badblocks
(read-only) scan on /dev/hde. It actually found a bad sector on the
disk.

I wanted to take the disk out to get me a new one, but unfortunately my
hdg seems to have run into trouble too, now. I also have some
SeekComplete/BadCRC errors in my log for that disk, too.

Furthermore, i got this:

Jul 25 10:35:49 blackbox kernel: ide: failed opcode was: unknown
Jul 25 10:35:49 blackbox kernel: hdg: DMA disabled
Jul 25 10:35:49 blackbox kernel: PDC202XX: Secondary channel reset.
Jul 25 10:35:49 blackbox kernel: PDC202XX: Primary channel reset.
Jul 25 10:35:49 blackbox kernel: hde: lost interrupt
Jul 25 10:35:49 blackbox kernel: ide3: reset: master: error (0x00?)
Jul 25 10:35:49 blackbox kernel: end_request: I/O error, dev hdg, 
sector 488396928
Jul 25 10:35:49 blackbox kernel: end_request: I/O error, dev hdg, 
sector 159368976
Jul 25 10:35:49 blackbox kernel: end_request: I/O error, dev hdg, 
sector 159368984
Jul 25 10:35:49 blackbox kernel: end_request: I/O error, dev hdg, 
sector 159368992
Jul 25 10:35:49 blackbox kernel: end_request: I/O error, dev hdg, 
sector 159369000
Jul 25 10:35:49 blackbox kernel: end_request: I/O error, dev hdg, 
sector 159369008
Jul 25 10:35:49 blackbox kernel: end_request: I/O error, dev hdg, 
sector 159369016
Jul 25 10:35:49 blackbox kernel: end_request: I/O error, dev hdg, 
sector 159369024
Jul 25 10:35:49 blackbox kernel: end_request: I/O error, dev hdg, 
sector 159369032
Jul 25 10:35:49 blackbox kernel: md: write_disk_sb failed for device hdg
Jul 25 10:35:49 blackbox kernel: lost page write due to I/O error on md0
Jul 25 10:35:49 blackbox kernel: lost page write due to I/O error on md0
Jul 25 10:35:49 blackbox kernel: RAID5 conf printout:
Jul 25 10:35:49 blackbox kernel:  --- rd:4 wd:2 fd:2
Jul 25 10:35:49 blackbox kernel:  disk 0, o:1, dev:hdk
Jul 25 10:35:49 blackbox kernel:  disk 1, o:1, dev:hdi
Jul 25 10:35:49 blackbox kernel:  disk 2, o:0, dev:hdg
Jul 25 10:35:49 blackbox kernel: RAID5 conf printout:
Jul 25 10:35:49 blackbox kernel:  --- rd:4 wd:2 fd:2
Jul 25 10:35:49 blackbox kernel:  disk 0, o:1, dev:hdk
Jul 25 10:35:49 blackbox kernel:  disk 1, o:1, dev:hdi
Jul 25 10:35:49 blackbox kernel: lost page write due to I/O error on md0

Well, now it seems I have to failed disks in my RAID-5, which of course
would be fatal. I am still hoping to somehow rescue the data on the
array somehow, but I am not sure what would be the best approach. I 
don't
want to cause any more damage.

When booting my system with all four disks connected, hde and hdg as
expected won't get added:

Jul 26 18:07:59 blackbox kernel: md: hdg has invalid sb, not importing!
Jul 26 18:07:59 blackbox kernel: md: autorun ...
Jul 26 18:07:59 blackbox kernel: md: considering hdi ...
Jul 26 18:07:59 blackbox kernel: md:  adding hdi ...
Jul 26 18:07:59 blackbox kernel: md:  adding hdk ...
Jul 26 18:07:59 blackbox kernel: md:  adding hde ...
Jul 26 18:07:59 blackbox kernel: md: created md0
Jul 26 18:07:59 blackbox kernel: md: bind<hde>
Jul 26 18:07:59 blackbox kernel: md: bind<hdk>
Jul 26 18:07:59 blackbox kernel: md: bind<hdi>
Jul 26 18:07:59 blackbox kernel: md: running: <hdi><hdk><hde>
Jul 26 18:07:59 blackbox kernel: md: kicking non-fresh hde from array!
Jul 26 18:07:59 blackbox kernel: md: unbind<hde>
Jul 26 18:07:59 blackbox kernel: md: export_rdev(hde)
Jul 26 18:07:59 blackbox kernel: raid5: device hdi operational as 
raid disk 1
Jul 26 18:07:59 blackbox kernel: raid5: device hdk operational as 
raid disk 0
Jul 26 18:07:59 blackbox kernel: RAID5 conf printout:
Jul 26 18:07:59 blackbox kernel:  --- rd:4 wd:2 fd:2
Jul 26 18:07:59 blackbox kernel:  disk 0, o:1, dev:hdk
Jul 26 18:07:59 blackbox kernel:  disk 1, o:1, dev:hdi
Jul 26 18:07:59 blackbox kernel: md: do_md_run() returned -22
Jul 26 18:07:59 blackbox kernel: md: md0 stopped.
Jul 26 18:07:59 blackbox kernel: md: unbind<hdi>
Jul 26 18:07:59 blackbox kernel: md: export_rdev(hdi)
Jul 26 18:07:59 blackbox kernel: md: unbind<hdk>
Jul 26 18:07:59 blackbox kernel: md: export_rdev(hdk)
Jul 26 18:07:59 blackbox kernel: md: ... autorun DONE.

So hde is not fresh (it has been removed from the array for quite some
time now) and hdg has an invalid superblock.

Any advice on what I should do now? Should I better try to rebuild the
array with hde or with hdg?

Greetings,
Frank
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html