Re: RAID5 recovering

Pierre Martineau <pierre.martineau@xxxxxxxxx> · Mon, 15 Apr 2013 17:58:40 +0200

Thanks a lot!
The array seems to start with only minor problems

mdadm: forcing event count in /dev/sdd1(3) from 112333 upto 112358
mdadm: clearing FAULTY flag for device 1 in /dev/md0 for /dev/sdd1
mdadm: /dev/md0 has been started with 3 drives (out of 4).

File systems are corrupted but not too seriously.
I will have a look for RAID6 in the future.

Thanks again,
Pierre

Pierre MARTINEAU

Institut de Recherche en Cancérologie de Montpellier
Inserm U896 – Université Montpellier 1 – CRLC Val d’Aurelle
Campus Val d’Aurelle
208 Rue des Apothicaires
F-34298 Montpellier Cedex 5, France

Tel: +33 (0)4 67 61 37 43
Fax: +33 (0)4 67 61 37 87
E-mail: pierre.martineau@xxxxxxxxx
E-mail: pierre.martineau@xxxxxxxxxxxxxxxxxxxxxxxx
Site internet: http://www.ircm.fr

Le 15/04/2013 17:19, Robin Hill a écrit :
On Mon Apr 15, 2013 at 03:47:39PM +0200, Pierre Martineau wrote:

Dear Raid experts,

I have a Raid5 volume that recently crashed and I need you advices
before doing some irreversible action.

Let me first summarize the past and current state.

1) I had a nicely running RAID5 volume with 3 x 1 To disks (LVM on top
and several LVM volumes in ext3 and axt4) but volume was now a bit too
small and I decided to add a new 1 To disk.

Given the rebuild time for a 1To disk, I'd be wary of running RAID5 - if
you have the space, adding another disk and going to RAID6 will be much
safer.

2) I added a new disk and did not do anything for a couple of days (Raid
still running with 3 disks)

3) One of the old disk failed and was ejected from the RAID.

4) The ejected disk was not even present as /dev/sdX. I thus tested the
connections and the disk came back.

5) I resync the ejected disk and I was back with my original 3 disk array.

6) I waited 2-3 days and everything was fine. I then added the new disk
and resync.

7) I had now a running 4 disk RAID5 array, I created a new volume and
started copying on it.

8) During the week-end, 2 disks were ejected from the array, the new
installed one and the same than previously (step 3)

9) Again the 2 disks were not present in /dev/sdX. I thus checked again
the connections and the problem was a molex connector. The two ejected
disks were on the same molex and this explains why both were detected as
faulty.

Now, my list of errors as a newbie.

4) I did not save all the informations before proceeding (mdadm
--examine, /etc/mdadm/mdadm.conf, syslog, ...)

5) I tried to assemble the disks with
mdadm --assemble --scan
with no result

6) I thus tried and this is my big error I think !!!
mdadm --assemble /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1

I forgot in this command /dev/md0 after assemble.
Because of this /dev/sdb1 suberblock was removed and now mdadm--examine
/dev/sdb1 returns "No md superblock detected on /dev/sdb1"

I would like now to be more cautious. If some nice expert from the list
would be nice enough to tell me if the proposed method described below
is the right approach I will be grateful for the rest of my life :-)

7) I read the RAID wiki and the list.

8) I saved
mdadm --examine /dev/sd[bcde]1
dmesg
syslog
/etc/mdadm/mdadm.conf
fdisk -lu /dev/sd[bcde]

I put the content of this files at the end of this message (except dmesg
and syslog because they are very long).

9) /dev/sdd is the new disk. This is clear in the fdisk listing since it
is a 4K sector disk.
The normal order of the raid is thus (see mdadm --examine /dev/sd[de]1)
sdb1 sdc1 sde1 sdd1

10) Events are
/dev/sdb1: no md superblock (see 6)
/dev/sdc1: Events : 112358
/dev/sdd1: Events : 112333
/dev/sde1: Events : 112358

It seems that sdd was the first disk removed.
Presumably sdb1 is in sync since it was running with sdc1 when the sdd1
and sde1 were ejected from the array (see 8) but I can't be sure since I
stupidly erased its superblock!

11) I propose to re-create the array with the --assume-clean option,
then check everything using "fsck -n" and "mount -o ro"
the command would be:

mdadm --create /dev/md0 -e 0.90 --assume-clean --level=5 --n=4 \
--chunk=64 --size=976759936 /dev/sdb1 /dev/sdc1 /dev/sde1 /dev/sdd1

<-- snip -->

Have you tried to force assemble the array first? Recreating the array
is a risky option, so should be avoided if possible. First try doing:
   mdadm -Af /dev/md0 /dev/sd[cde]1

If that works then you'll need to re-add (and rebuild) /dev/sdb1. If it
doesn't work, try rerunning (after making sure the array is stopped) and
adding "-vvv" for extra verbosity, then send through the output from
that and anything relevant from dmesg.

HTH,
     Robin

begin:vcard
fn:Pierre MARTINEAU
n:MARTINEAU;Pierre
org:INSERM U896;IRCM
adr:208 rue des Apothicaires;;CRLC Val d'Aurelle-Paul Lamarque;Montpellier;;34298;France
email;internet:pierre.martineau@xxxxxxxxx
tel;work:+33 (0)4 67 61 37 43
tel;fax:+33 (0)4 67 61 37 87
x-mozilla-html:FALSE
url:http://www.ircm.fr
version:2.1
end:vcard