Re: Fwd: Failed Raid6 Array.....want some guidance before attempting restart

Another Sillyname <anothersname@xxxxxxxxxxxxxx> · Mon, 21 Sep 2015 09:09:31 +0100



Current watch status

Every 30.0s: cat /proc/mdstat
                          Mon Sep 21 09:07:03 2015

Personalities : [raid6] [raid5] [raid4]
mdxxx : active raid6 sdh1[8] sdc1[7] sdg1[0] sdi1[6] sdf1[3] sde1[2] sdd1[1]
      29301952000 blocks super 1.2 level 6, 512k chunk, algorithm 2
[7/5] [UUUU__U]
      [==========>..........]  recovery = 53.4%
(3133878924/5860390400) finish=358.2min speed=126856K/sec
      bitmap: 6/44 pages [24KB], 65536KB chunk

unused devices: <none>

On 21 September 2015 at 09:05, Another Sillyname
<anothersname@xxxxxxxxxxxxxx> wrote:
> I think at the moment I'm in leave it alone and let it run
> mode.....it'll be done in about 6 hours anyway and I'm adverse to
> 'tampering' with anything while I'm this exposed without any
> resilience.
>
> I meant to state in the earlier message when the rebuild happens next
> month I'll be installing Fedora 22 (and all the latest updates).  This
> is a high demand server (not high load but requiring high
> availability) so once rebuilt and running stable for a month it'll get
> 'locked down' without any changes for another couple of years.
>
>
>
>
>
> On 21 September 2015 at 08:57, Alexander Afonyashin
> <a.afonyashin@xxxxxxxxxxxxxx> wrote:
>> Hi,
>>
>> You may also try to increase rebuild rate by echo-ing min speed value:
>>
>> echo 100000 > /sys/block/mdX/md/sync_speed_min
>>
>> or via sysctl:
>>
>> sysctl -w dev.raid.speed_limit_min=100000
>>
>> Regards,
>> Alexander
>>
>> On Mon, Sep 21, 2015 at 4:59 AM, Another Sillyname
>> <anothersname@xxxxxxxxxxxxxx> wrote:
>>> Ignore last...having thought about it for 10 minutes the obvious thing
>>> to do is to add the drives back and allow the array to rebuild
>>> offline......
>>>
>>> For the following reasons....
>>>
>>> 1.  e2fsck -f -n /dev/mdxx reports all the data appears intact and
>>> that was what I believed anyway based on the information available to
>>> me.
>>>
>>> 2.  To finish the backup will take 30+ hours, that's 30+ hours of risk
>>> time where a single drive failure will compromise the data set.
>>>
>>> 3.  To 'add' the missing drives back into the array and allow the
>>> rebuild will take about 10 hours (based on my previous experience
>>> building this array), therefore the lower 'risk' course of action is
>>> to rebuild the array, then and only then, to restart the backup.
>>> There's over 20 hours less risk doing it this way.
>>>
>>> I realise I could do the two concurrently but I'd rather keep the
>>> array 'destressed' as much as possible until I've got at least one
>>> level of resilience restored.
>>>
>>> Having now added the drives back in as 'spares' mdstat is telling me a
>>> little over 12 hours to do the rebuild so it's now finger crossing
>>> time time then.
>>>
>>> Thanks for the help and advice....and most of all the confirmation my
>>> approach was the correct one.
>>>
>>>
>>>
>>> On 21 September 2015 at 02:32, Another Sillyname
>>> <anothersname@xxxxxxxxxxxxxx> wrote:
>>>> OK
>>>>
>>>> The array has come back up...but showing two drives as missing.
>>>>
>>>> mdadm --query --detail /dev/md127/dev/md127:
>>>>         Version : 1.2
>>>>   Creation Time : Sun May 10 14:47:51 2015
>>>>      Raid Level : raid6
>>>>      Array Size : 29301952000 (27944.52 GiB 30005.20 GB)
>>>>   Used Dev Size : 5860390400 (5588.90 GiB 6001.04 GB)
>>>>    Raid Devices : 7
>>>>   Total Devices : 5
>>>>     Persistence : Superblock is persistent
>>>>
>>>>   Intent Bitmap : Internal
>>>>
>>>>     Update Time : Mon Sep 21 02:21:48 2015
>>>>           State : active, degraded
>>>>  Active Devices : 5
>>>> Working Devices : 5
>>>>  Failed Devices : 0
>>>>   Spare Devices : 0
>>>>
>>>>          Layout : left-symmetric
>>>>      Chunk Size : 512K
>>>>
>>>>            Name : arandomserver.arandomlan.com:1
>>>>            UUID : da29a06f:f8cf1409:bc52afb2:6945ba08
>>>>          Events : 285469
>>>>
>>>>     Number   Major   Minor   RaidDevice State
>>>>        0       8       97        0      active sync   /dev/sdg1
>>>>        1       8       49        1      active sync   /dev/sdd1
>>>>        2       8       65        2      active sync   /dev/sde1
>>>>        3       8       81        3      active sync   /dev/sdf1
>>>>        8       0        0        8      removed
>>>>       10       0        0       10      removed
>>>>        6       8      129        6      active sync   /dev/sdi1
>>>>
>>>> Data appears to be intact (haven't done a full analysis yet).
>>>>
>>>> Does this mean I should add the 'missing' drives back into the array
>>>> (one at a time obviously)!.
>>>>
>>>> Also doesn't this mean I'm horribly exposed to any writes now as this
>>>> would move the current 5+2 further out of 'sync' with each other thus
>>>> meaning any further short term fail could smash the data set totally.
>>>>
>>>> I'm minded to stop any writes to the array in the short term and
>>>> continue just doing the backup (this in itself will take about 30+
>>>> hours).
>>>>
>>>> Ideas and observations?
>>>>
>>>>
>>>>
>>>> On 20 September 2015 at 10:54, Mikael Abrahamsson <swmike@xxxxxxxxx> wrote:
>>>>> On Sun, 20 Sep 2015, Another Sillyname wrote:
>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> Would you.....
>>>>>>
>>>>>> mdadm --assemble --force --scan
>>>>>>
>>>>>> or
>>>>>>
>>>>>> mdadm --assemble --force /dev/mdxx /dev/sd[c-i]1
>>>>>
>>>>>
>>>>> This last one is what I use myself.
>>>>>
>>>>>
>>>>> --
>>>>> Mikael Abrahamsson    email: swmike@xxxxxxxxx
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html