On Tue, Oct 19, 2010 at 07:34:19PM -0700, Nataraj wrote: > fred smith wrote:helppain/backups/disks/ > > hi all! > > > > back in Aug several of you assisted me in solving a problem where one > > of my drives had dropped out of (or been kicked out of) the raid1 array. > > > > something vaguely similar appears to have happened just a few mins ago, > > upon rebooting after a small update. I received four emails like this, > > one for /dev/md0, one for /dev/md1, one for /dev/md125 and one for > > /dev/md126: > > > > Subject: DegradedArray event on /dev/md125:fcshome.stoneham.ma.us > > X-Spambayes-Classification: unsure; 0.24 > > Status: RO > > Content-Length: 564 > > Lines: 23 > > > > This is an automatically generated mail message from mdadm > > running on fcshome.stoneham.ma.us > > > > A DegradedArray event had been detected on md device /dev/md125. > > > > Faithfully yours, etc.resources/ > > > > P.S. The /proc/mdstat file currently contains the following: > > > > Personalities : [raid1] > > md0 : active raid1 sda1[0] > > 104320 blocks [2/1] [U_] > > > > md126 : active raid1 sdb1[1] > > 104320 blocks [2/1] [_U] > > > > md125 : active raid1 sdb2[1] > > 312464128 blocks [2/1] [_U] > > > > md1 : active raid1 sda2[0] > > 312464128 blocks [2/1] [U_] > > > > unused devices: <none> > > > > firstly, what the heck are md125 and md126? previously there was > > only md0 and md1.... ???? > > > > secondly, I'm not sure what it's trying to tell me. it says there was a > > "degradedarray event" but at the bottom it says there are no unused devices. > > > > there are also some messages in /var/log/messages from the time of the > > boot earlier today, but they do NOT say anything about "kicking out" > > any of the md member devices (as they did in the event back in August): > > > > Oct 19 18:29:41 fcshome kernel: device-mapper: dm-raid45: initialized v0.2594l > > Oct 19 18:29:41 fcshome kernel: md: Autodetecting RAID arrays. > > Oct 19 18:29:41 fcshome kernel: md: autorun ... > > Oct 19 18:29:41 fcshome kernel: md: considering sdb2 ... > > Oct 19 18:29:41 fcshome kernel: md: adding sdb2 ... > > Oct 19 18:29:41 fcshome kernel: md: sdb1 has different UUID to sdb2 > > Oct 19 18:29:41 fcshome kernel: md: sda2 has same UUID but different superblock > > to sdb2 > > Oct 19 18:29:41 fcshome kernel: md: sda1 has different UUID to sdb2 > > Oct 19 18:29:41 fcshome kernel: md: created md125 > > Oct 19 18:29:41 fcshome kernel: md: bind<sdb2> > > Oct 19 18:29:41 fcshome kernel: md: running: <sdb2> > > Oct 19 18:29:41 fcshome kernel: raid1: raid set md125 active with 1 out of 2 mir > > rors > > Oct 19 18:29:41 fcshome kernel: md: considering sdb1 ... > > Oct 19 18:29:41 fcshome kernel: md: adding sdb1 ... > > Oct 19 18:29:41 fcshome kernel: md: sda2 has different UUID to sdb1 > > Oct 19 18:29:41 fcshome kernel: md: sda1 has same UUID but different superblock > > to sdb1 > > Oct 19 18:29:41 fcshome kernel: md: created md126 > > Oct 19 18:29:41 fcshome kernel: md: bind<sdb1> > > Oct 19 18:29:41 fcshome kernel: md: running: <sdb1> > > Oct 19 18:29:41 fcshome kernel: raid1: raid set md126 active with 1 out of 2 mirrors > > Oct 19 18:29:41 fcshome kernel: md: considering sda2 ... > > Oct 19 18:29:41 fcshome kernel: md: adding sda2 ... > > Oct 19 18:29:41 fcshome kernel: md: sda1 has different UUID to sda2 > > Oct 19 18:29:41 fcshome kernel: md: created md1 > > Oct 19 18:29:41 fcshome kernel: md: bind<sda2> > > Oct 19 18:29:41 fcshome kernel: md: running: <sda2> > > Oct 19 18:29:41 fcshome kernel: raid1: raid set md1 active with 1 out of 2 mirrors > > Oct 19 18:29:41 fcshome kernel: md: considering sda1 ... > > Oct 19 18:29:41 fcshome kernel: md: adding sda1 ... > > Oct 19 18:29:41 fcshome kernel: md: created md0 > > Oct 19 18:29:41 fcshome kernel: md: bind<sda1> > > Oct 19 18:29:41 fcshome kernel: md: running: <sda1> > > Oct 19 18:29:41 fcshome kernel: raid1: raid set md0 active with 1 out of 2 mirrors > > Oct 19 18:29:41 fcshome kernel: md: ... autorun DONE. > > > > and here's /etc/mdadm.conf: > > > > # cat /etc/mdadm.conf > > > > # mdadm.conf written out by anaconda > > DEVICE partitions > > MAILADDR fredex > > ARRAY /dev/md0 level=raid1 num-devices=2 uuid=4eb13e45:b5228982:f03cd503:f935bd69 > > ARRAY /dev/md1 level=raid1 num-devices=2 uuid=5c79b138:e36d4286:df9cf6f6:62ae1f12 > > > > which doesn't say anything about md125 or md126,... might they be some kind of detritus > > or fragments left over from whatever kind of failure caused the array to become degraded? > > > > do ya suppose a boot from power-off might somehow give it a whack upside the head so > > it'll reassemble itself according to mdadm.conf? > > > > I'm not sure which devices need to be failed and re-added to make it clean again (which > > is all I had to do when I had the aforementioned earlier problem.) > > > > Thanks in advance for any advice! > > > > Fred > > > > > I've seen this kind of thing happen when the autodetection stuff > misbehaves. I'm not sure why it does this or how to prevent it. Anyway, > to recover, I would use something like: > > mdadm --stop /dev/md125 > mdadm --stop /dev/md126 > > If for some reason the above commands fail, check and make sure it has > not automounted the file systems from md125 and md126. Hopefully this > won't happen. > > Then use: > mdadm /dev/md0 -a /dev/sdXX > To add back the drive which belongs in md0, and similar for md1. In > general, it won't let you add the wrong drive, but if you want to check use: > mdadm --examine /dev/sda1 | grep UUID > and so forth for all your drives and find the ones with the same UUID. Well, I've already tried to use --fail and --remove on md125 and md126 but I'm told the members are still active. mdadm /dev/md126 --fail /dev/sdb1 --remove /dev/sdb1 mdadm /dev/md125 --fail /dev/sdb2 --remove /dev/sdb2 mdadm /dev/md126 --fail /dev/sdb1 --remove /dev/sdb1 mdadm: set /dev/sdb1 faulty in /dev/md126 mdadm: hot remove failed for /dev/sdb1: Device or resource busy with the intention of then re-adding them to md0 and md1. so I tried: mdadm /dev/md0 --fail /dev/sda1 --remove /dev/sda1 and got a similar message. at which point I knew I was in over my head. > > When I create my Raid arrays, I always use the option --bitmap=internal. > With this option set, a bitmap is used to keep track of which pages on > the drive are out of date and then you only resync pages which need > updating instead of recopying the whole drive when this happens. In the > past I once added a bitmap to an existing raid1 array using something > like this. This may not be the exact command, but I know it can be done: > mdadm /dev/mdN --bitmap=internal > > Adding the bitmap is very worthwhile and saves time and risk of data > loss by not having to recopy the whole partition. > > Nataraj > _______________________________________________ > CentOS mailing list > CentOS@xxxxxxxxxx > http://lists.centos.org/mailman/listinfo/centos -- ------------------------------------------------------------------------------- .---- Fred Smith / ( /__ ,__. __ __ / __ : / / / / /__) / / /__) .+' Home: fredex@xxxxxxxxxxxxxxxxxxxxxx / / (__ (___ (__(_ (___ / :__ 781-438-5471 -------------------------------- Jude 1:24,25 --------------------------------- _______________________________________________ CentOS mailing list CentOS@xxxxxxxxxx http://lists.centos.org/mailman/listinfo/centos