Re: Fwd: Failed Raid6 Array.....want some guidance before attempting restart

Another Sillyname <anothersname@xxxxxxxxxxxxxx> · Tue, 22 Sep 2015 10:33:58 +0100

Just to let you know the array seems to have rebuilt fully.

I'm just repopulating the live SQL databases and doing the backup,
then will start the full investigation into what happened and the
restore logs.

Thanks for all the help guys.

On 21 September 2015 at 09:09, Another Sillyname
<anothersname@xxxxxxxxxxxxxx> wrote:
> Current watch status
>
> Every 30.0s: cat /proc/mdstat
>                           Mon Sep 21 09:07:03 2015
>
> Personalities : [raid6] [raid5] [raid4]
> mdxxx : active raid6 sdh1[8] sdc1[7] sdg1[0] sdi1[6] sdf1[3] sde1[2] sdd1[1]
>       29301952000 blocks super 1.2 level 6, 512k chunk, algorithm 2
> [7/5] [UUUU__U]
>       [==========>..........]  recovery = 53.4%
> (3133878924/5860390400) finish=358.2min speed=126856K/sec
>       bitmap: 6/44 pages [24KB], 65536KB chunk
>
> unused devices: <none>
>
> On 21 September 2015 at 09:05, Another Sillyname
> <anothersname@xxxxxxxxxxxxxx> wrote:
>> I think at the moment I'm in leave it alone and let it run
>> mode.....it'll be done in about 6 hours anyway and I'm adverse to
>> 'tampering' with anything while I'm this exposed without any
>> resilience.
>>
>> I meant to state in the earlier message when the rebuild happens next
>> month I'll be installing Fedora 22 (and all the latest updates).  This
>> is a high demand server (not high load but requiring high
>> availability) so once rebuilt and running stable for a month it'll get
>> 'locked down' without any changes for another couple of years.
>>
>>
>>
>>
>>
>> On 21 September 2015 at 08:57, Alexander Afonyashin
>> <a.afonyashin@xxxxxxxxxxxxxx> wrote:
>>> Hi,
>>>
>>> You may also try to increase rebuild rate by echo-ing min speed value:
>>>
>>> echo 100000 > /sys/block/mdX/md/sync_speed_min
>>>
>>> or via sysctl:
>>>
>>> sysctl -w dev.raid.speed_limit_min=100000
>>>
>>> Regards,
>>> Alexander
>>>
>>> On Mon, Sep 21, 2015 at 4:59 AM, Another Sillyname
>>> <anothersname@xxxxxxxxxxxxxx> wrote:
>>>> Ignore last...having thought about it for 10 minutes the obvious thing
>>>> to do is to add the drives back and allow the array to rebuild
>>>> offline......
>>>>
>>>> For the following reasons....
>>>>
>>>> 1.  e2fsck -f -n /dev/mdxx reports all the data appears intact and
>>>> that was what I believed anyway based on the information available to
>>>> me.
>>>>
>>>> 2.  To finish the backup will take 30+ hours, that's 30+ hours of risk
>>>> time where a single drive failure will compromise the data set.
>>>>
>>>> 3.  To 'add' the missing drives back into the array and allow the
>>>> rebuild will take about 10 hours (based on my previous experience
>>>> building this array), therefore the lower 'risk' course of action is
>>>> to rebuild the array, then and only then, to restart the backup.
>>>> There's over 20 hours less risk doing it this way.
>>>>
>>>> I realise I could do the two concurrently but I'd rather keep the
>>>> array 'destressed' as much as possible until I've got at least one
>>>> level of resilience restored.
>>>>
>>>> Having now added the drives back in as 'spares' mdstat is telling me a
>>>> little over 12 hours to do the rebuild so it's now finger crossing
>>>> time time then.
>>>>
>>>> Thanks for the help and advice....and most of all the confirmation my
>>>> approach was the correct one.
>>>>
>>>>
>>>>
>>>> On 21 September 2015 at 02:32, Another Sillyname
>>>> <anothersname@xxxxxxxxxxxxxx> wrote:
>>>>> OK
>>>>>
>>>>> The array has come back up...but showing two drives as missing.
>>>>>
>>>>> mdadm --query --detail /dev/md127/dev/md127:
>>>>>         Version : 1.2
>>>>>   Creation Time : Sun May 10 14:47:51 2015
>>>>>      Raid Level : raid6
>>>>>      Array Size : 29301952000 (27944.52 GiB 30005.20 GB)
>>>>>   Used Dev Size : 5860390400 (5588.90 GiB 6001.04 GB)
>>>>>    Raid Devices : 7
>>>>>   Total Devices : 5
>>>>>     Persistence : Superblock is persistent
>>>>>
>>>>>   Intent Bitmap : Internal
>>>>>
>>>>>     Update Time : Mon Sep 21 02:21:48 2015
>>>>>           State : active, degraded
>>>>>  Active Devices : 5
>>>>> Working Devices : 5
>>>>>  Failed Devices : 0
>>>>>   Spare Devices : 0
>>>>>
>>>>>          Layout : left-symmetric
>>>>>      Chunk Size : 512K
>>>>>
>>>>>            Name : arandomserver.arandomlan.com:1
>>>>>            UUID : da29a06f:f8cf1409:bc52afb2:6945ba08
>>>>>          Events : 285469
>>>>>
>>>>>     Number   Major   Minor   RaidDevice State
>>>>>        0       8       97        0      active sync   /dev/sdg1
>>>>>        1       8       49        1      active sync   /dev/sdd1
>>>>>        2       8       65        2      active sync   /dev/sde1
>>>>>        3       8       81        3      active sync   /dev/sdf1
>>>>>        8       0        0        8      removed
>>>>>       10       0        0       10      removed
>>>>>        6       8      129        6      active sync   /dev/sdi1
>>>>>
>>>>> Data appears to be intact (haven't done a full analysis yet).
>>>>>
>>>>> Does this mean I should add the 'missing' drives back into the array
>>>>> (one at a time obviously)!.
>>>>>
>>>>> Also doesn't this mean I'm horribly exposed to any writes now as this
>>>>> would move the current 5+2 further out of 'sync' with each other thus
>>>>> meaning any further short term fail could smash the data set totally.
>>>>>
>>>>> I'm minded to stop any writes to the array in the short term and
>>>>> continue just doing the backup (this in itself will take about 30+
>>>>> hours).
>>>>>
>>>>> Ideas and observations?
>>>>>
>>>>>
>>>>>
>>>>> On 20 September 2015 at 10:54, Mikael Abrahamsson <swmike@xxxxxxxxx> wrote:
>>>>>> On Sun, 20 Sep 2015, Another Sillyname wrote:
>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>> Would you.....
>>>>>>>
>>>>>>> mdadm --assemble --force --scan
>>>>>>>
>>>>>>> or
>>>>>>>
>>>>>>> mdadm --assemble --force /dev/mdxx /dev/sd[c-i]1
>>>>>>
>>>>>>
>>>>>> This last one is what I use myself.
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Mikael Abrahamsson    email: swmike@xxxxxxxxx
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html