Re: more software raid questions

fred smith <fredex@xxxxxxxxxxxxxxxxxxxxxx> · Wed, 20 Oct 2010 07:06:58 -0400

On Tue, Oct 19, 2010 at 07:34:19PM -0700, Nataraj wrote:
> fred smith wrote:helppain/backups/disks/
> > hi all!
> >
> > back in Aug several of you assisted me in solving a problem where one
> > of my drives had dropped out of (or been kicked out of) the raid1 array.
> >
> > something vaguely similar appears to have happened just a few mins ago,
> > upon rebooting after a small update. I received four emails like this,
> > one for /dev/md0, one for /dev/md1, one for /dev/md125 and one for
> > /dev/md126:
> >
> > 	Subject: DegradedArray event on /dev/md125:fcshome.stoneham.ma.us
> > 	X-Spambayes-Classification: unsure; 0.24
> > 	Status: RO
> > 	Content-Length: 564
> > 	Lines: 23
> >
> > 	This is an automatically generated mail message from mdadm
> > 	running on fcshome.stoneham.ma.us
> >
> > 	A DegradedArray event had been detected on md device /dev/md125.
> >
> > 	Faithfully yours, etc.resources/
> >
> > 	P.S. The /proc/mdstat file currently contains the following:
> >
> > 	Personalities : [raid1] 
> > 	md0 : active raid1 sda1[0]
> > 	      104320 blocks [2/1] [U_]
> > 	      
> > 	md126 : active raid1 sdb1[1]
> > 	      104320 blocks [2/1] [_U]
> > 	      
> > 	md125 : active raid1 sdb2[1]
> > 	      312464128 blocks [2/1] [_U]
> > 	      
> > 	md1 : active raid1 sda2[0]
> > 	      312464128 blocks [2/1] [U_]
> > 	      
> > 	unused devices: <none>
> >
> > firstly, what the heck are md125 and md126? previously there was
> > only md0 and md1.... ????
> >
> > secondly, I'm not sure what it's trying to tell me. it says there was a 
> > "degradedarray event" but at the bottom it says there are no unused devices.
> >
> > there are also some messages in /var/log/messages from the time of the
> > boot earlier today, but they do NOT say anything about "kicking out"
> > any of the md member devices (as they did in the event back in August):
> >
> > 	Oct 19 18:29:41 fcshome kernel: device-mapper: dm-raid45: initialized v0.2594l
> > 	Oct 19 18:29:41 fcshome kernel: md: Autodetecting RAID arrays.
> > 	Oct 19 18:29:41 fcshome kernel: md: autorun ...
> > 	Oct 19 18:29:41 fcshome kernel: md: considering sdb2 ...
> > 	Oct 19 18:29:41 fcshome kernel: md:  adding sdb2 ...
> > 	Oct 19 18:29:41 fcshome kernel: md: sdb1 has different UUID to sdb2
> > 	Oct 19 18:29:41 fcshome kernel: md: sda2 has same UUID but different superblock 
> > 	to sdb2
> > 	Oct 19 18:29:41 fcshome kernel: md: sda1 has different UUID to sdb2
> > 	Oct 19 18:29:41 fcshome kernel: md: created md125
> > 	Oct 19 18:29:41 fcshome kernel: md: bind<sdb2>
> > 	Oct 19 18:29:41 fcshome kernel: md: running: <sdb2>
> > 	Oct 19 18:29:41 fcshome kernel: raid1: raid set md125 active with 1 out of 2 mir
> > 	rors
> > 	Oct 19 18:29:41 fcshome kernel: md: considering sdb1 ...
> > 	Oct 19 18:29:41 fcshome kernel: md:  adding sdb1 ...
> > 	Oct 19 18:29:41 fcshome kernel: md: sda2 has different UUID to sdb1
> > 	Oct 19 18:29:41 fcshome kernel: md: sda1 has same UUID but different superblock 
> > 	to sdb1
> > 	Oct 19 18:29:41 fcshome kernel: md: created md126
> > 	Oct 19 18:29:41 fcshome kernel: md: bind<sdb1>
> > 	Oct 19 18:29:41 fcshome kernel: md: running: <sdb1>
> > 	Oct 19 18:29:41 fcshome kernel: raid1: raid set md126 active with 1 out of 2 mirrors
> > 	Oct 19 18:29:41 fcshome kernel: md: considering sda2 ...
> > 	Oct 19 18:29:41 fcshome kernel: md:  adding sda2 ...
> > 	Oct 19 18:29:41 fcshome kernel: md: sda1 has different UUID to sda2
> > 	Oct 19 18:29:41 fcshome kernel: md: created md1
> > 	Oct 19 18:29:41 fcshome kernel: md: bind<sda2>
> > 	Oct 19 18:29:41 fcshome kernel: md: running: <sda2>
> > 	Oct 19 18:29:41 fcshome kernel: raid1: raid set md1 active with 1 out of 2 mirrors
> > 	Oct 19 18:29:41 fcshome kernel: md: considering sda1 ...
> > 	Oct 19 18:29:41 fcshome kernel: md:  adding sda1 ...
> > 	Oct 19 18:29:41 fcshome kernel: md: created md0
> > 	Oct 19 18:29:41 fcshome kernel: md: bind<sda1>
> > 	Oct 19 18:29:41 fcshome kernel: md: running: <sda1>
> > 	Oct 19 18:29:41 fcshome kernel: raid1: raid set md0 active with 1 out of 2 mirrors
> > 	Oct 19 18:29:41 fcshome kernel: md: ... autorun DONE.
> >
> > and here's /etc/mdadm.conf:
> >
> > 	# cat /etc/mdadm.conf
> >
> > 	# mdadm.conf written out by anaconda
> > 	DEVICE partitions
> > 	MAILADDR fredex
> > 	ARRAY /dev/md0 level=raid1 num-devices=2 uuid=4eb13e45:b5228982:f03cd503:f935bd69
> > 	ARRAY /dev/md1 level=raid1 num-devices=2 uuid=5c79b138:e36d4286:df9cf6f6:62ae1f12
> >
> > which doesn't say anything about md125 or md126,... might they be some kind of detritus
> > or fragments left over from whatever kind of failure caused the array to become degraded?
> >
> > do ya suppose a boot from power-off might somehow give it a whack upside the head so
> > it'll reassemble itself according to mdadm.conf?
> >
> > I'm not sure which devices need to be failed and re-added to make it clean again (which
> > is all I had to do when I had the aforementioned earlier problem.)
> >
> > Thanks in advance for any advice!
> >
> > Fred
> >
> >   
> I've seen this kind of thing happen when the autodetection stuff 
> misbehaves. I'm not sure why it does this or how to prevent it. Anyway, 
> to recover, I would use something like:
> 
> mdadm --stop /dev/md125
> mdadm --stop /dev/md126
> 
> If for some reason the above commands fail, check and make sure it has 
> not automounted the file systems from md125 and md126. Hopefully this 
> won't happen.
> 
> Then use:
> mdadm /dev/md0 -a /dev/sdXX
> To add back the drive which belongs in md0, and similar for md1. In 
> general, it won't let you add the wrong drive, but if you want to check use:
> mdadm --examine /dev/sda1 | grep UUID
> and so forth for all your drives and find the ones with the same UUID.

Well, I've already tried to use --fail and --remove on md125 and md126
but I'm told the members are still active.

mdadm /dev/md126 --fail /dev/sdb1 --remove /dev/sdb1
mdadm /dev/md125 --fail /dev/sdb2 --remove /dev/sdb2

	mdadm /dev/md126 --fail /dev/sdb1 --remove /dev/sdb1
	mdadm: set /dev/sdb1 faulty in /dev/md126
	mdadm: hot remove failed for /dev/sdb1: Device or resource busy

with the intention of then re-adding them to md0 and md1.

so I tried:

mdadm /dev/md0 --fail /dev/sda1 --remove /dev/sda1
and got a similar message. 

at which point I knew I was in over my head.

> 
> When I create my Raid arrays, I always use the option --bitmap=internal. 
> With this option set, a bitmap is used to keep track of which pages on 
> the drive are out of date and then you only resync pages which need 
> updating instead of recopying the whole drive when this happens. In the 
> past I once added a bitmap to an existing raid1 array using something 
> like this. This may not be the exact command, but I know it can be done:
> mdadm /dev/mdN --bitmap=internal
> 
> Adding the bitmap is very worthwhile and saves time and risk of data 
> loss by not having to recopy the whole partition.
> 
> Nataraj
> _______________________________________________
> CentOS mailing list
> CentOS@xxxxxxxxxx
> http://lists.centos.org/mailman/listinfo/centos

-- 
-------------------------------------------------------------------------------
 .----    Fred Smith   /              
( /__  ,__.   __   __ /  __   : /     
 /    /  /   /__) /  /  /__) .+'           Home: fredex@xxxxxxxxxxxxxxxxxxxxxx 
/    /  (__ (___ (__(_ (___ / :__                                 781-438-5471 
-------------------------------- Jude 1:24,25 ---------------------------------
_______________________________________________
CentOS mailing list
CentOS@xxxxxxxxxx
http://lists.centos.org/mailman/listinfo/centos