Neil Brown wrote:
On Sunday December 4, a_a@xxxxxx wrote:
Hi,
I have a RAID5 array consisting of 4 disks:
/dev/hda3
/dev/hdc3
/dev/hde3
/dev/hdg3
and the Linux machine that this system was running on crashed yesterday
due to a faulty Kernel driver (i.e. the machine just halted).
So I resetted it, but it didn't come up again.
I started the machine with a Knoppix CD and found out that the array had
been running in degraded mode for about two months (/dev/hda3 went off
then).
Here is a short snipped of the syslog:
--------------------------------------
Oct 22 15:30:07 omega kernel: hda: dma_intr: status=0x51 { DriveReady
SeekComplete Error }
Oct 22 15:30:07 omega kernel: hda: dma_intr: error=0x40 {
UncorrectableError }, LBAsect=454088, sector=4264
Oct 22 15:30:07 omega kernel: end_request: I/O error, dev 03:03 (hda),
sector 4264
Oct 22 15:30:07 omega kernel: raid5: Disk failure on hda3, disabling
device. Operation continuing on 3 devices
Oct 22 15:30:07 omega kernel: md: updating md0 RAID superblock on device
Oct 22 15:30:07 omega kernel: md: hda3 (skipping faulty)
Oct 22 15:30:07 omega kernel: md: hdc3 [events: 00000137]
Oct 22 15:30:07 omega kernel: (write) hdc3's sb offset: 119834496
Oct 22 15:30:07 omega kernel: md: recovery thread got woken up ...
Oct 22 15:30:07 omega kernel: md: hde3 [events: 00000137]
Oct 22 15:30:07 omega kernel: (write) hde3's sb offset: 119834496
Oct 22 15:30:07 omega kernel: md: hdg3 [events: 00000137]
Oct 22 15:30:07 omega kernel: (write) hdg3's sb offset: 119834496
Oct 22 15:30:07 omega kernel: md0: no spare disk to reconstruct array!
-- continuing in degraded mode
Oct 22 15:30:07 omega kernel: md: recovery thread finished ...
You want to be running "mdadm --monitor". You really really do!
Anyone out there who is listening: if you have any md/raid arrays
(other than linear/raid0) and are not running "mdadm --monitor",
please do so. Now.
Also run "mdadm --monitor --oneshot --scan" (or similar) from a
nightly cron job, so it will nag you about degraded arrays.
Please!
Yes you are absolutely right! It was my first thought when I saw the
broken array: "There _must_ be a program that monitors the array
automatically for me and gives an alert if something goes wrong!
And it will be the first thing to do after the array is running again!
But why do you think that hda3 dropped out of the array 2 months ago?
The update time reported by mdadm --examine is
Update Time : Sat Dec 3 18:56:59 2005
This comes from an attemt to assemble the array from hda3, hde3 and
hdg3. The first "mdadm --examine" printed out an update time for hda3
something in October...
The superblock from hda3 seems to suggest that it was hdc3 that was
the problem.... odd.
"pass 1: checking Inodes, Blocks, and sizes
read error - Block 131460 (Attempt to read block from filesystem
resulted in short read) during Inode-Scan Ignore error?"
This strongly suggests there is a problem with one of the drives - it
is returning read errors. Are there any informative kernel logs.
If it is hdc that is reporting errors, try to re-assemble the array
from hda3, hde3, hdg3.
That is what I already tried, but didn't succeed. So I tried it with
hd[ceg]3 and could even mount the array and the data seem to be OK at
first glance. What I could certainly do is to plug in an external USB
hard drive and to copy as many data as possible to the USB drive, but
the problem is that the array consists of 4x120GB resulting in about
360GB of data. So I hope I can reconstruct it without copying...
But the real strange thing to me is that I can mount the array and the
data seem to be OK, but the "fsck" produces so many errors....
The other question is why does /dev/hdg3 appear _two_times_ and
/dev/hda3 _doesn't_at_all_ when I type
mdadm --create /dev/md0 -c32 -l5 -n4 missing /dev/hdc3 /dev/hde3 /dev/hdg3
mdadm: /dev/hdc3 appears to be part of a raid array:
level=5 devices=4 ctime=Fri May 30 14:25:47 2003
mdadm: /dev/hde3 appears to be part of a raid array:
level=5 devices=4 ctime=Fri May 30 14:25:47 2003
mdadm: /dev/hdg3 appears to contain an ext2fs file system
size=493736704K mtime=Tue Jan 3 04:48:21 2006
mdadm: /dev/hdg3 appears to be part of a raid array:
level=5 devices=4 ctime=Fri May 30 14:25:47 2003
Continue creating array? no
mdadm: create aborted.
Thanks in advance
Alfons
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html