Re: Problem with --manage

Benjamin Schieder <blindcoder@xxxxxxxxxxxxxxxxxxxx> · Tue, 18 Jul 2006 08:08:29 +0200

On 18.07.2006 15:46:53, Neil Brown wrote:
> On Monday July 17, blindcoder@xxxxxxxxxxxxxxxxxxxx wrote:
> > 
> > /dev/md/0 on /boot type ext2 (rw,nogrpid)
> > /dev/md/1 on / type reiserfs (rw)
> > /dev/md/2 on /var type reiserfs (rw)
> > /dev/md/3 on /opt type reiserfs (rw)
> > /dev/md/4 on /usr type reiserfs (rw)
> > /dev/md/5 on /data type reiserfs (rw)
> > 
> > I'm running the following kernel:
> > Linux ceres 2.6.16.18-rock #1 SMP PREEMPT Sun Jun 25 10:47:51 CEST 2006 i686 GNU/Linux
> > 
> > and mdadm 2.4.
> > Now, hdb seems to be broken, even though smart says everything's fine.
> > After a day or two, hdb would fail:
> > 
> > Jul 16 16:58:41 ceres kernel: raid5: Disk failure on hdb3, disabling device. Operation continuing on 2 devices
> > Jul 16 16:58:41 ceres kernel: raid5: Disk failure on hdb5, disabling device. Operation continuing on 2 devices
> > Jul 16 16:59:06 ceres kernel: raid5: Disk failure on hdb7, disabling device. Operation continuing on 2 devices
> > Jul 16 16:59:37 ceres kernel: raid5: Disk failure on hdb8, disabling device. Operation continuing on 2 devices
> > Jul 16 17:02:22 ceres kernel: raid5: Disk failure on hdb6, disabling device. Operation continuing on 2 devices
> 
> Very odd... no other message from the kernel?  You would expect
> something if there was a real error.

This was the only output on the console. But I just checked /var/log/messages
now... ouch...

---
Jul 16 16:59:36 ceres kernel: hdb: status error: status=0x00 { }
Jul 16 16:59:36 ceres kernel: ide: failed opcode was: 0xea
Jul 16 16:59:36 ceres kernel: hdb: drive not ready for command
Jul 16 16:59:36 ceres kernel: hdb: status error: status=0x10 { SeekComplete }
Jul 16 16:59:36 ceres kernel: ide: failed opcode was: unknown
Jul 16 16:59:36 ceres kernel: hdb: drive not ready for command
Jul 16 16:59:36 ceres kernel: hdb: status error: status=0x10 { SeekComplete }
Jul 16 16:59:36 ceres kernel: ide: failed opcode was: unknown
Jul 16 16:59:36 ceres kernel: hdb: drive not ready for command
Jul 16 16:59:36 ceres kernel: hdb: status error: status=0x10 { SeekComplete }
Jul 16 16:59:36 ceres kernel: ide: failed opcode was: unknown
Jul 16 16:59:36 ceres kernel: hdb: drive not ready for command
Jul 16 16:59:36 ceres kernel: hdb: status error: status=0x10 { SeekComplete }
Jul 16 16:59:36 ceres kernel: ide: failed opcode was: unknown
Jul 16 16:59:37 ceres kernel: hdb: drive not ready for command
Jul 16 16:59:37 ceres kernel: ide0: reset: success
Jul 16 16:59:37 ceres kernel: hdb: status error: status=0x10 { SeekComplete }
Jul 16 16:59:37 ceres kernel: ide: failed opcode was: unknown
Jul 16 16:59:37 ceres kernel: hdb: drive not ready for command
Jul 16 16:59:37 ceres kernel: hdb: status error: status=0x00 { }
Jul 16 16:59:37 ceres kernel: ide: failed opcode was: unknown
Jul 16 16:59:37 ceres kernel: hdb: drive not ready for command
Jul 16 16:59:37 ceres kernel: hdb: status error: status=0x10 { SeekComplete }
Jul 16 16:59:37 ceres kernel: ide: failed opcode was: unknown
Jul 16 16:59:37 ceres kernel: hdb: drive not ready for command
Jul 16 16:59:37 ceres kernel: hdb: status error: status=0x10 { SeekComplete }
Jul 16 16:59:37 ceres kernel: ide: failed opcode was: unknown
Jul 16 16:59:37 ceres kernel: hdb: drive not ready for command
Jul 16 16:59:37 ceres kernel: ide0: reset: success
Jul 16 16:59:37 ceres kernel: hdb: status error: status=0x00 { }
Jul 16 16:59:37 ceres kernel: ide: failed opcode was: unknown
Jul 16 16:59:37 ceres kernel: end_request: I/O error, dev hdb, sector 488391932
Jul 16 16:59:37 ceres kernel: hdb: drive not ready for command
Jul 16 16:59:37 ceres kernel: hdb: status error: status=0x10 { SeekComplete }
Jul 16 16:59:37 ceres kernel: ide: failed opcode was: 0xea
Jul 16 16:59:37 ceres kernel: raid5: Disk failure on hdb8, disabling device. Operation continuing on 2 devices
Jul 16 16:59:37 ceres kernel: hdb: drive not ready for command
Jul 16 16:59:37 ceres kernel: RAID5 conf printout:
Jul 16 16:59:37 ceres kernel:  --- rd:3 wd:2 fd:1
Jul 16 16:59:37 ceres kernel:  disk 0, o:0, dev:hdb8
Jul 16 16:59:37 ceres kernel:  disk 1, o:1, dev:hda8
Jul 16 16:59:37 ceres kernel:  disk 2, o:1, dev:hdc8
Jul 16 16:59:37 ceres kernel: RAID5 conf printout:
Jul 16 16:59:37 ceres kernel:  --- rd:3 wd:2 fd:1
Jul 16 16:59:37 ceres kernel:  disk 1, o:1, dev:hda8
Jul 16 16:59:37 ceres kernel:  disk 2, o:1, dev:hdc8
---

Now, is this a broken IDE controller or harddisk? Because smartctl claims
that everything is fine.

> > The problem now is, the machine hangs after the last message and I can only
> > turn it off by physically removing the power plug.
> 
> alt-sysrq-P  or alt-sysrq-T give anything useful?

I tried alt-sysrq-o and -b, to no avail. Support for it is in my kernel and
it works (tested earlier).

> > When I now reboot the machine, `mdadm -A /dev/md[1-5]' will not start the
> > arrays cleanly. They will all be lacking the hdb device and be 'inactive'.
> > `mdadm -R' will not start them in this state. According to
> > `mdadm --manage --help' using `mdadm --manage /dev/md/3 -a /dev/hdb6'
> > should add /dev/hdb6 to /dev/md/3, but nothing really happens.
> > After some trying, I realised that `mdadm /dev/md/3 -a /dev/hdb6' actually
> > works. So where's the problem? The help message? The parameter parsing code?
> > My understanding?
> 
> I don't understand.  'mdadm --manage /dev/md/3 -a /dev/hdb6' is
> exactly the same command as without the --manage.  Maybe if you
> provide a log of exactly what you did, exactly what the messages were,
> and exactly what the result (e.g. in /proc/mdstat) was.

I don't have a script log or something, but here's what I did from an initrd
with init=/bin/bash

# < mount /dev /proc /sys /tmp >
# < start udevd udevtrigger udevsettle >
while read a dev c ; do
	[ "$a" != "ARRAY" ] && continue
	[ -e /dev/md/${dev##*/} ] || /bin/mknod $dev b 9 ${dev##*/}
	/sbin/mdadm -A ${dev}
done < /etc/mdadm.conf

This is the mdadm.conf:
DEVICE partitions
ARRAY /dev/md/0 level=raid1 num-devices=3 UUID=3559ffcf:14eb9889:3826d6c2:c13731d7
ARRAY /dev/md/1 level=raid5 num-devices=3 UUID=649fc7cc:d4b52c31:240fce2c:c64686e7
ARRAY /dev/md/2 level=raid5 num-devices=3 UUID=9a3bf634:58f39e44:27ba8087:d5189766
   spares=1
ARRAY /dev/md/3 level=raid5 num-devices=3 UUID=29ff75f4:66f2639c:976cbcfe:1bd9a1b4
   spares=1
ARRAY /dev/md/4 level=raid5 num-devices=3 UUID=d4799be3:5b157884:e38718c2:c05ab840
   spares=1
ARRAY /dev/md/5 level=raid5 num-devices=3 UUID=ca4a6110:4533d8d5:0e2ed4e1:2f5805b2
   spares=1

MAIL root@localhost

At this moment, only /dev/md/0 was active. Reconstructed, /proc/mdstat looked
something like this:

Personalities : [linear] [raid0] [raid1] [raid5] [raid4]
md5 : inactive raid5 hda8[1] hdc8[2]
      451426304 blocks level 5, 64k chunk, algorithm 2 [2/3] [_UU]

md4 : inactive raid5 hda7[1] hdc7[2]
      13992320 blocks level 5, 64k chunk, algorithm 2 [2/3] [_UU]

md3 : inactive raid5 hdc6[1] hda6[0]
      8000128 blocks level 5, 64k chunk, algorithm 2 [2/3] [_UU]

md2 : inactive raid5 hda5[1] hdc5[2]
      5991936 blocks level 5, 64k chunk, algorithm 2 [2/3] [_UU]

md1 : inactive raid5 hda3[1] hdc3[2]
      5992064 blocks level 5, 64k chunk, algorithm 2 [2/3] [_UU]

md0 : active raid1 hdb1[0] hdc1[2] hda1[1]
      497856 blocks [3/3] [UUU]

unused devices: <none>

I'm not sure about the line containing blocks, level and such, but I'm sure
about the first line of each mdx.

At this point, doing anything to /dev/md/[1-5] would give me an Input/Output
error.

Running
# mdadm -R /dev/md/1
would give me this error:
Jul 16 17:17:42 ceres kernel: raid5: cannot start dirty degraded array for md1

# mdadm --manage /dev/md/1 --add /dev/hdb3
would do nothing. No message on stdout, stderr, kernel or anything. It would
just do nothing.
# mdadm /dev/md/1 --add /dev/hdb3
would in turn add hdb3 to md/1 and then I was able to
# mdadm -R /dev/md/1
and the resync would start.

Right now I'm quite sure the problem will arise again (see messages above).
I'll try to create a script log of what happens when I encounter the problem
again.

Greetings,
	Benjamin
-- 
Today, memory either forgets things when you don't want it to,
or remembers things long after they're better forgotten.
Attachment:
pgpAgUObtgaLS.pgp

Description: PGP signature