Re: Failed drive while converting raid5 to raid6, then a hard reboot

Hákon Gíslason <hakon.gislason@xxxxxxxxx> · Tue, 8 May 2012 22:19:49 +0000

Thank you for the reply, Neil
I was using mdadm from the package manager in Debian stable first
(v3.1.4), but after the constant drive failures I upgraded to the
latest one (3.2.3).
I've come to the conclusion that the drives are either failing because
they are "green" drives, and might have power-saving features that are
causing them to be "disconnected", or that the cables that came with
the motherboard aren't good enough. I'm not 100% sure about either,
but at the moment these seem likely causes. It could be incompatible
hardware or the kernel that I'm using (proxmox debian kernel:
2.6.32-11-pve).

I got the array assembled (thank you), but what about the raid5 to
raid6 conversion? Do I have to complete it for this to work, or will
mdadm know what to do? Can I cancel (revert) the conversion and get
the array back to raid5?

/proc/mdstat contains:

root@axiom:~# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active (read-only) raid6 sdc[6] sdb[5] sda[4] sdd[7]
      5860540224 blocks super 1.2 level 6, 32k chunk, algorithm 18 [5/3] [_UUU_]

unused devices: <none>

If I try to mount the volume group on the array the kernel panics, and
the system hangs. Is that related to the incomplete conversion?

Thanks,
--
Hákon G.

On 8 May 2012 20:48, NeilBrown <neilb@xxxxxxx> wrote:
>
> On Mon, 30 Apr 2012 13:59:56 +0000 Hákon Gíslason
> <hakon.gislason@xxxxxxxxx>
> wrote:
>
> > Hello,
> > I've been having frequent drive "failures", as in, they are reported
> > failed/bad and mdadm sends me an email telling me things went wrong,
> > etc... but after a reboot or two, they are perfectly fine again. I'm
> > not sure what it is, but this server is quite new and I think there
> > might be more behind it, bad memory or the motherboard (I've been
> > having other issues as well). I've had 4 drive "failures" in this
> > month, all different drives except for one, which "failed" twice, and
> > all have been fixed with a reboot or rebuild (all drives reported bad
> > by mdadm passed an extensive SMART test).
> > Due to this, I decided to convert my raid5 array to a raid6 array
> > while I find the root cause of the problem.
> >
> > I started the conversion right after a drive failure & rebuild, but as
> > it had converted/reshaped aprox. 4%(if I remember correctly, and it
> > was going really slowly, ~7500 minutes to completion), it reported
> > another drive bad, and the conversion to raid6 stopped (it said
> > "rebuilding", but the speed was 0K/sec and the time left was a few
> > million minutes.
> > After that happened, I tried to stop the array and reboot the server,
> > as I had done previously to get the reportedly "bad" drive working
> > again, but It wouldn't stop the array or reboot, neither could I
> > unmount it, it just hung whenever I tried to do something with
> > /dev/md0. After trying to reboot a few times, I just killed the power
> > and re-started it. Admittedly this was probably not the best thing I
> > could have done at that point.
> >
> > I have backup of ca. 80% of the data on there, it's been a month since
> > the last complete backup (because I ran out of backup disk space).
> >
> > So, the big question, can the array be activated, and can it complete
> > the conversion to raid6? And will I get my data back?
> > I hope the data can be rescued, and any help I can get would be much
> > appreciated!
> >
> > I'm fairly new to raid in general, and have been using mdadm for about
> > a month now.
> > Here's some data:
> >
> > root@axiom:~# mdadm --examine --scan
> > ARRAY /dev/md/0 metadata=1.2 UUID=cfedbfc1:feaee982:4e92ccf4:45e08ed1
> > name=axiom.is:0
> >
> >
> > root@axiom:~# cat /proc/mdstat
> > Personalities : [raid6] [raid5] [raid4]
> > md0 : inactive sdc[6] sde[7] sdb[5] sda[4]
> >       7814054240 blocks super 1.2
> >
> > root@axiom:~# mdadm --assemble --scan --force --run /dev/md0
> > mdadm: /dev/md0 is already in use.
> >
> > root@axiom:~# mdadm --stop /dev/md0
> > mdadm: stopped /dev/md0
> >
> > root@axiom:~# mdadm --assemble --scan --force --run /dev/md0
> > mdadm: Failed to restore critical section for reshape, sorry.
> >       Possibly you needed to specify the --backup-file
> >
> > root@axiom:~# mdadm --assemble --scan --force --run /dev/md0
> > --backup-file=/root/mdadm-backup-file
> > mdadm: Failed to restore critical section for reshape, sorry.
>
> What version of mdadm are you using?
>
> I suggest getting a newer one (I'm about to release 3.2.4, but 3.2.3
> should
> be fine) and if just that doesn't help, add the "--invalid-backup" option.
>
> However I very strongly suggest you try to resolve the problem which is
> causing your drives to fail.  Until you resolve that it will keep
> happening
> and having it happen repeatly during the (slow) reshape process would not
> be
> good.
>
> Maybe plug the drives into another computer, or another controller, while
> the
> reshape runs?
>
> NeilBrown
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html