Re: Problem with --manage

Neil Brown <neilb@xxxxxxx> · Tue, 18 Jul 2006 15:46:53 +1000

On Monday July 17, blindcoder@xxxxxxxxxxxxxxxxxxxx wrote:
> 
> /dev/md/0 on /boot type ext2 (rw,nogrpid)
> /dev/md/1 on / type reiserfs (rw)
> /dev/md/2 on /var type reiserfs (rw)
> /dev/md/3 on /opt type reiserfs (rw)
> /dev/md/4 on /usr type reiserfs (rw)
> /dev/md/5 on /data type reiserfs (rw)
> 
> I'm running the following kernel:
> Linux ceres 2.6.16.18-rock #1 SMP PREEMPT Sun Jun 25 10:47:51 CEST 2006 i686 GNU/Linux
> 
> and mdadm 2.4.
> Now, hdb seems to be broken, even though smart says everything's fine.
> After a day or two, hdb would fail:
> 
> Jul 16 16:58:41 ceres kernel: raid5: Disk failure on hdb3, disabling device. Operation continuing on 2 devices
> Jul 16 16:58:41 ceres kernel: raid5: Disk failure on hdb5, disabling device. Operation continuing on 2 devices
> Jul 16 16:59:06 ceres kernel: raid5: Disk failure on hdb7, disabling device. Operation continuing on 2 devices
> Jul 16 16:59:37 ceres kernel: raid5: Disk failure on hdb8, disabling device. Operation continuing on 2 devices
> Jul 16 17:02:22 ceres kernel: raid5: Disk failure on hdb6, disabling device. Operation continuing on 2 devices

Very odd... no other message from the kernel?  You would expect
something if there was a real error.

> 
> Strange enough, this never happens to the raid1 md0 device, so maybe I'm
> totally wrong about hdb failing after all.

Not too surprising - md0 is /boot and it is likely that you are doing
no IO to that filesystem.

> 
> The problem now is, the machine hangs after the last message and I can only
> turn it off by physically removing the power plug.

alt-sysrq-P  or alt-sysrq-T give anything useful?

Sounds like a device driver problem.

> 
> When I now reboot the machine, `mdadm -A /dev/md[1-5]' will not start the
> arrays cleanly. They will all be lacking the hdb device and be 'inactive'.
> `mdadm -R' will not start them in this state. According to
> `mdadm --manage --help' using `mdadm --manage /dev/md/3 -a /dev/hdb6'
> should add /dev/hdb6 to /dev/md/3, but nothing really happens.
> After some trying, I realised that `mdadm /dev/md/3 -a /dev/hdb6' actually
> works. So where's the problem? The help message? The parameter parsing code?
> My understanding?

I don't understand.  'mdadm --manage /dev/md/3 -a /dev/hdb6' is
exactly the same command as without the --manage.  Maybe if you
provide a log of exactly what you did, exactly what the messages were,
and exactly what the result (e.g. in /proc/mdstat) was.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html