On Mon, 27 Apr 2015 11:35:09 +0200 Peter van Es <vanes.peter@xxxxxxxxx> wrote: > Sorry for the long post... > > I am running Ubuntu LTS 14.04.02 Server edition, 64 bits, with 4x 2.0TB drives in a raid-5 array. > > The 4th drive was beginning to show read errors. Because it was weekend, I could not go out > and buy a spare 2TB drive to replace the one that was beginning to fail. > > I first got a fail event: > > This is an automatically generated mail message from mdadm > running on bali > > A Fail event had been detected on md device /dev/md/1. > > It could be related to component device /dev/sdd2. > > Faithfully yours, etc. > > P.S. The /proc/mdstat file currently contains the following: > > Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] > md1 : active raid5 sdc2[2] sdb2[1] sda2[0] sdd2[3](F) > 5854290432 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/3] [UUU_] > > md0 : active raid5 sdc1[2] sdd1[3] sdb1[1] sda1[0] > 5850624 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/4] [UUUU] > > unused devices: <none> > > And then subsequently, around 18 hours later: > > This is an automatically generated mail message from mdadm > running on bali > > A DegradedArray event had been detected on md device /dev/md/1. This isn't really reporting anything new. There is probably a daily cron job which reports all degraded arrays. This message is reported by that job. > > Faithfully yours, etc. > > P.S. The /proc/mdstat file currently contains the following: > > Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] > md1 : active raid5 sdc2[2] sdb2[1] sda2[0] sdd2[3](F) > 5854290432 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/3] [UUU_] > > md0 : active raid5 sdc1[2] sdd1[3] sdb1[1] sda1[0] > 5850624 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/4] [UUUU] > > unused devices: <none> > > The server had taken the array off line at that point. Why do you think the array is off-line? The above message doesn't suggest that. > > Needless to say, I can't boot the system anymore as the boot drive is /dev/md0, and GRUB can't > get at it. I do need to recover data (I know, but there's stuf on there I have no backup for--yet). You boot off a RAID5? Does grub support that? I didn't know. But md0 hasn't failed, has it? Confused. > > I booted Linux from a USB stick (which is on /dev/sdc1 hence changing the numbering), > in recovery mode. Below is the output of /proc/mdstat and > mdadm --examine. It looks like somehow the /dev/sdd2 and /dev/sde2 drives took on the > super block of the /dev/md127 device (my swap file). May that have been done by the boot from > the Ubuntu USB stick? There is something VERY sick here. I suggest that you tread very carefully. All your '1' partitions should be about 2GB and the '2' parititions about 2TB But the --examine output suggests sda2 and sdb2 are 2TB, while sdd2 and sde2 are 2GB. That really really shouldn't happen. Maybe check your partition table (fdisk). I really cannot see how this would happen. > > My plan... assemble a degraded array, with /dev/sde2 (the 4th drive, formerly known as /dev/sdd2) not in it. > Because the fail event put the file system in RO mode, I expect /dev/sdd2 (formerly /dev/sdc2) to be ok. > Then insert new 2TB drive in slot 4. Let system resync and recover. > > I'm running xfs on the /dev/md1 device. > > Questions: > > 1. is this the wise course of action ? > 2. how exactly do I reassemble the array (/etc/mdadm.conf is inaccessible in recovery mode) > 3. what command line options do I use exactly from the --examine output below without screwing things up > > And help or pointers gratefully accepted Can you mdadm -Ss to stop all the arrays, then fdisk -l /dev/sd? then mdadm -Esvv and post all of that. Hopefully some of it will make sense. NeilBrown
Attachment:
pgpxvZV9m7WDY.pgp
Description: OpenPGP digital signature