On 18.07.2006 15:46:53, Neil Brown wrote: > On Monday July 17, blindcoder@xxxxxxxxxxxxxxxxxxxx wrote: > > > > /dev/md/0 on /boot type ext2 (rw,nogrpid) > > /dev/md/1 on / type reiserfs (rw) > > /dev/md/2 on /var type reiserfs (rw) > > /dev/md/3 on /opt type reiserfs (rw) > > /dev/md/4 on /usr type reiserfs (rw) > > /dev/md/5 on /data type reiserfs (rw) > > > > I'm running the following kernel: > > Linux ceres 2.6.16.18-rock #1 SMP PREEMPT Sun Jun 25 10:47:51 CEST 2006 i686 GNU/Linux > > > > and mdadm 2.4. > > Now, hdb seems to be broken, even though smart says everything's fine. > > After a day or two, hdb would fail: > > > > Jul 16 16:58:41 ceres kernel: raid5: Disk failure on hdb3, disabling device. Operation continuing on 2 devices > > Jul 16 16:58:41 ceres kernel: raid5: Disk failure on hdb5, disabling device. Operation continuing on 2 devices > > Jul 16 16:59:06 ceres kernel: raid5: Disk failure on hdb7, disabling device. Operation continuing on 2 devices > > Jul 16 16:59:37 ceres kernel: raid5: Disk failure on hdb8, disabling device. Operation continuing on 2 devices > > Jul 16 17:02:22 ceres kernel: raid5: Disk failure on hdb6, disabling device. Operation continuing on 2 devices > > Very odd... no other message from the kernel? You would expect > something if there was a real error. This was the only output on the console. But I just checked /var/log/messages now... ouch... --- Jul 16 16:59:36 ceres kernel: hdb: status error: status=0x00 { } Jul 16 16:59:36 ceres kernel: ide: failed opcode was: 0xea Jul 16 16:59:36 ceres kernel: hdb: drive not ready for command Jul 16 16:59:36 ceres kernel: hdb: status error: status=0x10 { SeekComplete } Jul 16 16:59:36 ceres kernel: ide: failed opcode was: unknown Jul 16 16:59:36 ceres kernel: hdb: drive not ready for command Jul 16 16:59:36 ceres kernel: hdb: status error: status=0x10 { SeekComplete } Jul 16 16:59:36 ceres kernel: ide: failed opcode was: unknown Jul 16 16:59:36 ceres kernel: hdb: drive not ready for command Jul 16 16:59:36 ceres kernel: hdb: status error: status=0x10 { SeekComplete } Jul 16 16:59:36 ceres kernel: ide: failed opcode was: unknown Jul 16 16:59:36 ceres kernel: hdb: drive not ready for command Jul 16 16:59:36 ceres kernel: hdb: status error: status=0x10 { SeekComplete } Jul 16 16:59:36 ceres kernel: ide: failed opcode was: unknown Jul 16 16:59:37 ceres kernel: hdb: drive not ready for command Jul 16 16:59:37 ceres kernel: ide0: reset: success Jul 16 16:59:37 ceres kernel: hdb: status error: status=0x10 { SeekComplete } Jul 16 16:59:37 ceres kernel: ide: failed opcode was: unknown Jul 16 16:59:37 ceres kernel: hdb: drive not ready for command Jul 16 16:59:37 ceres kernel: hdb: status error: status=0x00 { } Jul 16 16:59:37 ceres kernel: ide: failed opcode was: unknown Jul 16 16:59:37 ceres kernel: hdb: drive not ready for command Jul 16 16:59:37 ceres kernel: hdb: status error: status=0x10 { SeekComplete } Jul 16 16:59:37 ceres kernel: ide: failed opcode was: unknown Jul 16 16:59:37 ceres kernel: hdb: drive not ready for command Jul 16 16:59:37 ceres kernel: hdb: status error: status=0x10 { SeekComplete } Jul 16 16:59:37 ceres kernel: ide: failed opcode was: unknown Jul 16 16:59:37 ceres kernel: hdb: drive not ready for command Jul 16 16:59:37 ceres kernel: ide0: reset: success Jul 16 16:59:37 ceres kernel: hdb: status error: status=0x00 { } Jul 16 16:59:37 ceres kernel: ide: failed opcode was: unknown Jul 16 16:59:37 ceres kernel: end_request: I/O error, dev hdb, sector 488391932 Jul 16 16:59:37 ceres kernel: hdb: drive not ready for command Jul 16 16:59:37 ceres kernel: hdb: status error: status=0x10 { SeekComplete } Jul 16 16:59:37 ceres kernel: ide: failed opcode was: 0xea Jul 16 16:59:37 ceres kernel: raid5: Disk failure on hdb8, disabling device. Operation continuing on 2 devices Jul 16 16:59:37 ceres kernel: hdb: drive not ready for command Jul 16 16:59:37 ceres kernel: RAID5 conf printout: Jul 16 16:59:37 ceres kernel: --- rd:3 wd:2 fd:1 Jul 16 16:59:37 ceres kernel: disk 0, o:0, dev:hdb8 Jul 16 16:59:37 ceres kernel: disk 1, o:1, dev:hda8 Jul 16 16:59:37 ceres kernel: disk 2, o:1, dev:hdc8 Jul 16 16:59:37 ceres kernel: RAID5 conf printout: Jul 16 16:59:37 ceres kernel: --- rd:3 wd:2 fd:1 Jul 16 16:59:37 ceres kernel: disk 1, o:1, dev:hda8 Jul 16 16:59:37 ceres kernel: disk 2, o:1, dev:hdc8 --- Now, is this a broken IDE controller or harddisk? Because smartctl claims that everything is fine. > > The problem now is, the machine hangs after the last message and I can only > > turn it off by physically removing the power plug. > > alt-sysrq-P or alt-sysrq-T give anything useful? I tried alt-sysrq-o and -b, to no avail. Support for it is in my kernel and it works (tested earlier). > > When I now reboot the machine, `mdadm -A /dev/md[1-5]' will not start the > > arrays cleanly. They will all be lacking the hdb device and be 'inactive'. > > `mdadm -R' will not start them in this state. According to > > `mdadm --manage --help' using `mdadm --manage /dev/md/3 -a /dev/hdb6' > > should add /dev/hdb6 to /dev/md/3, but nothing really happens. > > After some trying, I realised that `mdadm /dev/md/3 -a /dev/hdb6' actually > > works. So where's the problem? The help message? The parameter parsing code? > > My understanding? > > I don't understand. 'mdadm --manage /dev/md/3 -a /dev/hdb6' is > exactly the same command as without the --manage. Maybe if you > provide a log of exactly what you did, exactly what the messages were, > and exactly what the result (e.g. in /proc/mdstat) was. I don't have a script log or something, but here's what I did from an initrd with init=/bin/bash # < mount /dev /proc /sys /tmp > # < start udevd udevtrigger udevsettle > while read a dev c ; do [ "$a" != "ARRAY" ] && continue [ -e /dev/md/${dev##*/} ] || /bin/mknod $dev b 9 ${dev##*/} /sbin/mdadm -A ${dev} done < /etc/mdadm.conf This is the mdadm.conf: DEVICE partitions ARRAY /dev/md/0 level=raid1 num-devices=3 UUID=3559ffcf:14eb9889:3826d6c2:c13731d7 ARRAY /dev/md/1 level=raid5 num-devices=3 UUID=649fc7cc:d4b52c31:240fce2c:c64686e7 ARRAY /dev/md/2 level=raid5 num-devices=3 UUID=9a3bf634:58f39e44:27ba8087:d5189766 spares=1 ARRAY /dev/md/3 level=raid5 num-devices=3 UUID=29ff75f4:66f2639c:976cbcfe:1bd9a1b4 spares=1 ARRAY /dev/md/4 level=raid5 num-devices=3 UUID=d4799be3:5b157884:e38718c2:c05ab840 spares=1 ARRAY /dev/md/5 level=raid5 num-devices=3 UUID=ca4a6110:4533d8d5:0e2ed4e1:2f5805b2 spares=1 MAIL root@localhost At this moment, only /dev/md/0 was active. Reconstructed, /proc/mdstat looked something like this: Personalities : [linear] [raid0] [raid1] [raid5] [raid4] md5 : inactive raid5 hda8[1] hdc8[2] 451426304 blocks level 5, 64k chunk, algorithm 2 [2/3] [_UU] md4 : inactive raid5 hda7[1] hdc7[2] 13992320 blocks level 5, 64k chunk, algorithm 2 [2/3] [_UU] md3 : inactive raid5 hdc6[1] hda6[0] 8000128 blocks level 5, 64k chunk, algorithm 2 [2/3] [_UU] md2 : inactive raid5 hda5[1] hdc5[2] 5991936 blocks level 5, 64k chunk, algorithm 2 [2/3] [_UU] md1 : inactive raid5 hda3[1] hdc3[2] 5992064 blocks level 5, 64k chunk, algorithm 2 [2/3] [_UU] md0 : active raid1 hdb1[0] hdc1[2] hda1[1] 497856 blocks [3/3] [UUU] unused devices: <none> I'm not sure about the line containing blocks, level and such, but I'm sure about the first line of each mdx. At this point, doing anything to /dev/md/[1-5] would give me an Input/Output error. Running # mdadm -R /dev/md/1 would give me this error: Jul 16 17:17:42 ceres kernel: raid5: cannot start dirty degraded array for md1 # mdadm --manage /dev/md/1 --add /dev/hdb3 would do nothing. No message on stdout, stderr, kernel or anything. It would just do nothing. # mdadm /dev/md/1 --add /dev/hdb3 would in turn add hdb3 to md/1 and then I was able to # mdadm -R /dev/md/1 and the resync would start. Right now I'm quite sure the problem will arise again (see messages above). I'll try to create a script log of what happens when I encounter the problem again. Greetings, Benjamin -- Today, memory either forgets things when you don't want it to, or remembers things long after they're better forgotten.
Attachment:
pgpAgUObtgaLS.pgp
Description: PGP signature