Re: Failed drive while converting raid5 to raid6, then a hard reboot

Hákon Gíslason <hakon.gislason@xxxxxxxxx> · Wed, 9 May 2012 00:46:15 +0000



Nevermind, got it back up and running using --force.
Stopping the array, halting the server, going to image all the disks first.
--
Hákon G.


On 9 May 2012 00:20, Hákon Gíslason <hakon.gislason@xxxxxxxxx> wrote:
> Hi again, I thought the drives would last long enough to complete the
> reshape, I assembled the array, it started reshaping, went for a
> shower, and came back to this: http://pastebin.ubuntu.com/976993/
>
> The logs show the same as when the other drives failed:
> May  8 23:58:26 axiom kernel: ata4: hard resetting link
> May  8 23:58:32 axiom kernel: ata4: link is slow to respond, please be
> patient (ready=0)
> May  8 23:58:37 axiom kernel: ata4: hard resetting link
> May  8 23:58:42 axiom kernel: ata4: link is slow to respond, please be
> patient (ready=0)
> May  8 23:58:47 axiom kernel: ata4: hard resetting link
> May  8 23:58:52 axiom kernel: ata4: link is slow to respond, please be
> patient (ready=0)
> May  8 23:59:22 axiom kernel: ata4: limiting SATA link speed to 1.5 Gbps
> May  8 23:59:22 axiom kernel: ata4: hard resetting link
> May  8 23:59:27 axiom kernel: ata4.00: disabled
> May  8 23:59:27 axiom kernel: ata4: EH complete
> May  8 23:59:27 axiom kernel: sd 3:0:0:0: [sdd] Unhandled error code
> May  8 23:59:27 axiom kernel: sd 3:0:0:0: [sdd] Result:
> hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
> May  8 23:59:27 axiom kernel: sd 3:0:0:0: [sdd] CDB: Write(10): 2a 00
> 00 00 00 08 00 00 02 00
> May  8 23:59:27 axiom kernel: md: super_written gets error=-5, uptodate=0
> May  8 23:59:27 axiom kernel: sd 3:0:0:0: [sdd] Unhandled error code
> May  8 23:59:27 axiom kernel: sd 3:0:0:0: [sdd] Result:
> hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
> May  8 23:59:27 axiom kernel: sd 3:0:0:0: [sdd] CDB: Read(10): 28 00
> 0a 9d cb 00 00 00 40 00
> May  8 23:59:27 axiom kernel: md: md0: reshape done.
>
> What course of action do you suggest I take now?
>
> --
> Hákon G.
>
>
> On 8 May 2012 23:55, Hákon Gíslason <hakon.gislason@xxxxxxxxx> wrote:
>> Thank you very much!
>> It's currently rebuilding, I'll make an attempt to mount the volume
>> once it completes the build. But before that, I'm going to image all
>> the disks to my friends array, just to be safe. After that, backup
>> everything.
>> Again, thank you for your help!
>> --
>> Hákon G.
>>
>>
>> On 8 May 2012 23:21, NeilBrown <neilb@xxxxxxx> wrote:
>>> On Tue, 8 May 2012 22:19:49 +0000 Hákon Gíslason <hakon.gislason@xxxxxxxxx>
>>> wrote:
>>>
>>>> Thank you for the reply, Neil
>>>> I was using mdadm from the package manager in Debian stable first
>>>> (v3.1.4), but after the constant drive failures I upgraded to the
>>>> latest one (3.2.3).
>>>> I've come to the conclusion that the drives are either failing because
>>>> they are "green" drives, and might have power-saving features that are
>>>> causing them to be "disconnected", or that the cables that came with
>>>> the motherboard aren't good enough. I'm not 100% sure about either,
>>>> but at the moment these seem likely causes. It could be incompatible
>>>> hardware or the kernel that I'm using (proxmox debian kernel:
>>>> 2.6.32-11-pve).
>>>>
>>>> I got the array assembled (thank you), but what about the raid5 to
>>>> raid6 conversion? Do I have to complete it for this to work, or will
>>>> mdadm know what to do? Can I cancel (revert) the conversion and get
>>>> the array back to raid5?
>>>>
>>>> /proc/mdstat contains:
>>>>
>>>> root@axiom:~# cat /proc/mdstat
>>>> Personalities : [raid6] [raid5] [raid4]
>>>> md0 : active (read-only) raid6 sdc[6] sdb[5] sda[4] sdd[7]
>>>>       5860540224 blocks super 1.2 level 6, 32k chunk, algorithm 18 [5/3] [_UUU_]
>>>>
>>>> unused devices: <none>
>>>>
>>>> If I try to mount the volume group on the array the kernel panics, and
>>>> the system hangs. Is that related to the incomplete conversion?
>>>
>>> The array should be part way through the conversion.  If you
>>>   mdadm -E /dev/sda
>>> it should report something like "Reshape Position : XXXX" indicating
>>> how far along it is.
>>> The reshape will not restart while the array is read-only.  Once you make it
>>> writeable it will automatically restart the reshape from where it is up to.
>>>
>>> The kernel panic is because the array is read-only and the filesystem tries
>>> to write to it.  I think that is fixed in more recent kernels (i.e. ext4
>>> refuses to mount rather than trying and crashing).
>>>
>>> So you should just be able to "mdadm --read-write /dev/md0" to make the array
>>> writable, and then continue using it ... until another device fails.
>>>
>>> Reverting the reshape is not currently possible.  Maybe it will be with Linux
>>> 3.5 and mdadm-3.3, but that is all months away.
>>>
>>> I would recommend an "fsck -n /dev/md0" first and if that seems mostly OK,
>>> and if "mdadm -E /dev/sda" reports the "Reshape Position" as expected, then
>>> make the array read-write, mount it, and backup any important data.
>>>
>>> NeilBrown
>>>
>>>
>>>>
>>>> Thanks,
>>>> --
>>>> Hákon G.
>>>>
>>>>
>>>>
>>>> On 8 May 2012 20:48, NeilBrown <neilb@xxxxxxx> wrote:
>>>> >
>>>> > On Mon, 30 Apr 2012 13:59:56 +0000 Hákon Gíslason
>>>> > <hakon.gislason@xxxxxxxxx>
>>>> > wrote:
>>>> >
>>>> > > Hello,
>>>> > > I've been having frequent drive "failures", as in, they are reported
>>>> > > failed/bad and mdadm sends me an email telling me things went wrong,
>>>> > > etc... but after a reboot or two, they are perfectly fine again. I'm
>>>> > > not sure what it is, but this server is quite new and I think there
>>>> > > might be more behind it, bad memory or the motherboard (I've been
>>>> > > having other issues as well). I've had 4 drive "failures" in this
>>>> > > month, all different drives except for one, which "failed" twice, and
>>>> > > all have been fixed with a reboot or rebuild (all drives reported bad
>>>> > > by mdadm passed an extensive SMART test).
>>>> > > Due to this, I decided to convert my raid5 array to a raid6 array
>>>> > > while I find the root cause of the problem.
>>>> > >
>>>> > > I started the conversion right after a drive failure & rebuild, but as
>>>> > > it had converted/reshaped aprox. 4%(if I remember correctly, and it
>>>> > > was going really slowly, ~7500 minutes to completion), it reported
>>>> > > another drive bad, and the conversion to raid6 stopped (it said
>>>> > > "rebuilding", but the speed was 0K/sec and the time left was a few
>>>> > > million minutes.
>>>> > > After that happened, I tried to stop the array and reboot the server,
>>>> > > as I had done previously to get the reportedly "bad" drive working
>>>> > > again, but It wouldn't stop the array or reboot, neither could I
>>>> > > unmount it, it just hung whenever I tried to do something with
>>>> > > /dev/md0. After trying to reboot a few times, I just killed the power
>>>> > > and re-started it. Admittedly this was probably not the best thing I
>>>> > > could have done at that point.
>>>> > >
>>>> > > I have backup of ca. 80% of the data on there, it's been a month since
>>>> > > the last complete backup (because I ran out of backup disk space).
>>>> > >
>>>> > > So, the big question, can the array be activated, and can it complete
>>>> > > the conversion to raid6? And will I get my data back?
>>>> > > I hope the data can be rescued, and any help I can get would be much
>>>> > > appreciated!
>>>> > >
>>>> > > I'm fairly new to raid in general, and have been using mdadm for about
>>>> > > a month now.
>>>> > > Here's some data:
>>>> > >
>>>> > > root@axiom:~# mdadm --examine --scan
>>>> > > ARRAY /dev/md/0 metadata=1.2 UUID=cfedbfc1:feaee982:4e92ccf4:45e08ed1
>>>> > > name=axiom.is:0
>>>> > >
>>>> > >
>>>> > > root@axiom:~# cat /proc/mdstat
>>>> > > Personalities : [raid6] [raid5] [raid4]
>>>> > > md0 : inactive sdc[6] sde[7] sdb[5] sda[4]
>>>> > >       7814054240 blocks super 1.2
>>>> > >
>>>> > > root@axiom:~# mdadm --assemble --scan --force --run /dev/md0
>>>> > > mdadm: /dev/md0 is already in use.
>>>> > >
>>>> > > root@axiom:~# mdadm --stop /dev/md0
>>>> > > mdadm: stopped /dev/md0
>>>> > >
>>>> > > root@axiom:~# mdadm --assemble --scan --force --run /dev/md0
>>>> > > mdadm: Failed to restore critical section for reshape, sorry.
>>>> > >       Possibly you needed to specify the --backup-file
>>>> > >
>>>> > > root@axiom:~# mdadm --assemble --scan --force --run /dev/md0
>>>> > > --backup-file=/root/mdadm-backup-file
>>>> > > mdadm: Failed to restore critical section for reshape, sorry.
>>>> >
>>>> > What version of mdadm are you using?
>>>> >
>>>> > I suggest getting a newer one (I'm about to release 3.2.4, but 3.2.3
>>>> > should
>>>> > be fine) and if just that doesn't help, add the "--invalid-backup" option.
>>>> >
>>>> > However I very strongly suggest you try to resolve the problem which is
>>>> > causing your drives to fail.  Until you resolve that it will keep
>>>> > happening
>>>> > and having it happen repeatly during the (slow) reshape process would not
>>>> > be
>>>> > good.
>>>> >
>>>> > Maybe plug the drives into another computer, or another controller, while
>>>> > the
>>>> > reshape runs?
>>>> >
>>>> > NeilBrown
>>>> >
>>>> >
>>>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html