Re: Fwd: Failed Raid6 Array.....want some guidance before attempting restart

Another Sillyname <anothersname@xxxxxxxxxxxxxx> · Mon, 21 Sep 2015 09:05:14 +0100

I think at the moment I'm in leave it alone and let it run
mode.....it'll be done in about 6 hours anyway and I'm adverse to
'tampering' with anything while I'm this exposed without any
resilience.

I meant to state in the earlier message when the rebuild happens next
month I'll be installing Fedora 22 (and all the latest updates).  This
is a high demand server (not high load but requiring high
availability) so once rebuilt and running stable for a month it'll get
'locked down' without any changes for another couple of years.

On 21 September 2015 at 08:57, Alexander Afonyashin
<a.afonyashin@xxxxxxxxxxxxxx> wrote:
> Hi,
>
> You may also try to increase rebuild rate by echo-ing min speed value:
>
> echo 100000 > /sys/block/mdX/md/sync_speed_min
>
> or via sysctl:
>
> sysctl -w dev.raid.speed_limit_min=100000
>
> Regards,
> Alexander
>
> On Mon, Sep 21, 2015 at 4:59 AM, Another Sillyname
> <anothersname@xxxxxxxxxxxxxx> wrote:
>> Ignore last...having thought about it for 10 minutes the obvious thing
>> to do is to add the drives back and allow the array to rebuild
>> offline......
>>
>> For the following reasons....
>>
>> 1.  e2fsck -f -n /dev/mdxx reports all the data appears intact and
>> that was what I believed anyway based on the information available to
>> me.
>>
>> 2.  To finish the backup will take 30+ hours, that's 30+ hours of risk
>> time where a single drive failure will compromise the data set.
>>
>> 3.  To 'add' the missing drives back into the array and allow the
>> rebuild will take about 10 hours (based on my previous experience
>> building this array), therefore the lower 'risk' course of action is
>> to rebuild the array, then and only then, to restart the backup.
>> There's over 20 hours less risk doing it this way.
>>
>> I realise I could do the two concurrently but I'd rather keep the
>> array 'destressed' as much as possible until I've got at least one
>> level of resilience restored.
>>
>> Having now added the drives back in as 'spares' mdstat is telling me a
>> little over 12 hours to do the rebuild so it's now finger crossing
>> time time then.
>>
>> Thanks for the help and advice....and most of all the confirmation my
>> approach was the correct one.
>>
>>
>>
>> On 21 September 2015 at 02:32, Another Sillyname
>> <anothersname@xxxxxxxxxxxxxx> wrote:
>>> OK
>>>
>>> The array has come back up...but showing two drives as missing.
>>>
>>> mdadm --query --detail /dev/md127/dev/md127:
>>>         Version : 1.2
>>>   Creation Time : Sun May 10 14:47:51 2015
>>>      Raid Level : raid6
>>>      Array Size : 29301952000 (27944.52 GiB 30005.20 GB)
>>>   Used Dev Size : 5860390400 (5588.90 GiB 6001.04 GB)
>>>    Raid Devices : 7
>>>   Total Devices : 5
>>>     Persistence : Superblock is persistent
>>>
>>>   Intent Bitmap : Internal
>>>
>>>     Update Time : Mon Sep 21 02:21:48 2015
>>>           State : active, degraded
>>>  Active Devices : 5
>>> Working Devices : 5
>>>  Failed Devices : 0
>>>   Spare Devices : 0
>>>
>>>          Layout : left-symmetric
>>>      Chunk Size : 512K
>>>
>>>            Name : arandomserver.arandomlan.com:1
>>>            UUID : da29a06f:f8cf1409:bc52afb2:6945ba08
>>>          Events : 285469
>>>
>>>     Number   Major   Minor   RaidDevice State
>>>        0       8       97        0      active sync   /dev/sdg1
>>>        1       8       49        1      active sync   /dev/sdd1
>>>        2       8       65        2      active sync   /dev/sde1
>>>        3       8       81        3      active sync   /dev/sdf1
>>>        8       0        0        8      removed
>>>       10       0        0       10      removed
>>>        6       8      129        6      active sync   /dev/sdi1
>>>
>>> Data appears to be intact (haven't done a full analysis yet).
>>>
>>> Does this mean I should add the 'missing' drives back into the array
>>> (one at a time obviously)!.
>>>
>>> Also doesn't this mean I'm horribly exposed to any writes now as this
>>> would move the current 5+2 further out of 'sync' with each other thus
>>> meaning any further short term fail could smash the data set totally.
>>>
>>> I'm minded to stop any writes to the array in the short term and
>>> continue just doing the backup (this in itself will take about 30+
>>> hours).
>>>
>>> Ideas and observations?
>>>
>>>
>>>
>>> On 20 September 2015 at 10:54, Mikael Abrahamsson <swmike@xxxxxxxxx> wrote:
>>>> On Sun, 20 Sep 2015, Another Sillyname wrote:
>>>>
>>>>> Thanks
>>>>>
>>>>> Would you.....
>>>>>
>>>>> mdadm --assemble --force --scan
>>>>>
>>>>> or
>>>>>
>>>>> mdadm --assemble --force /dev/mdxx /dev/sd[c-i]1
>>>>
>>>>
>>>> This last one is what I use myself.
>>>>
>>>>
>>>> --
>>>> Mikael Abrahamsson    email: swmike@xxxxxxxxx
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html