RE: can i recover an all spare raid10 array ?

Roland RoLaNd <r_o_l_a_n_d@xxxxxxxxxxx> · Tue, 28 Oct 2014 22:17:44 +0200

----------------------------------------
> Date: Tue, 28 Oct 2014 20:02:39 +0000
> From: robin@xxxxxxxxxxxxxxx
> To: r_o_l_a_n_d@xxxxxxxxxxx
> CC: linux-raid@xxxxxxxxxxxxxxx
> Subject: Re: can i recover an all spare raid10 array ?
>
> On Tue Oct 28, 2014 at 09:11:21PM +0200, Roland RoLaNd wrote:
>
>>
>>
>> ----------------------------------------
>>> Date: Tue, 28 Oct 2014 18:34:22 +0000
>>> From: robin@xxxxxxxxxxxxxxx
>>> To: r_o_l_a_n_d@xxxxxxxxxxx
>>> CC: robin@xxxxxxxxxxxxxxx; linux-raid@xxxxxxxxxxxxxxx
>>> Subject: Re: can i recover an all spare raid10 array ?
>>>
>>> Please don't top post, it makes conversations very difficult to follow.
>>> Responses should go at the bottom, or interleaved with the previous post
>>> if responding to particular points. I've moved your previous responses
>>> to keep the conversation flow straight.
>>>
>>> On Tue Oct 28, 2014 at 07:30:50PM +0200, Roland RoLaNd wrote:
>>>>
>>>>> From: r_o_l_a_n_d@xxxxxxxxxxx
>>>>> To: robin@xxxxxxxxxxxxxxx
>>>>> CC: linux-raid@xxxxxxxxxxxxxxx
>>>>> Subject: Re: can i recover an all spare raid10 array ?
>>>>> Date: Tue, 28 Oct 2014 19:29:25 +0200
>>>>>
>>>>>> Date: Tue, 28 Oct 2014 17:01:11 +0000
>>>>>> From: robin@xxxxxxxxxxxxxxx
>>>>>> To: r_o_l_a_n_d@xxxxxxxxxxx
>>>>>> CC: linux-raid@xxxxxxxxxxxxxxx
>>>>>> Subject: Re: can i recover an all spare raid10 array ?
>>>>>>
>>>>>> On Tue Oct 28, 2014 at 06:22:11PM +0200, Roland RoLaNd wrote:
>>>>>>
>>>>>>> I have two raid arrays on my system:
>>>>>>> raid1: /dev/sdd1 /dev/sdh1
>>>>>>> raid10: /dev/sde1 /dev/sda1 /dev/sdf1 /dec/sdb1 /dev/sdc1 /dev/sdg1
>>>>>>>
>>>>>>>
>>>>>>> two disks had bad sectors: sdd and sdf <<-- they both got hot swapped.
>>>>>>> i added sdf back to raid10 and recovery took place but adding sdd1 to
>>>>>>> raid1 proved to be troublesome
>>>>>>> as i didn't have anything important on '/' i formatted and installed
>>>>>>> ubuntu 14 on raid1
>>>>>>>
>>>>>>> now system is up on raid 1, but raid10 (md127) is inactive
>>>>>>>
>>>>>>> cat /proc/mdstat
>>>>>>>
>>>>>>> Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
>>>>>>> md127 : inactive sde1[2](S) sdg1[8](S) sdc1[6](S) sdb1[5](S) sdf1[4](S) sda1[3](S)
>>>>>>> 17580804096 blocks super 1.2
>>>>>>>
>>>>>>> md2 : active raid1 sdh4[0] sdd4[1]
>>>>>>> 2921839424 blocks super 1.2 [2/2] [UU]
>>>>>>> [==>..................] resync = 10.4% (304322368/2921839424) finish=672.5min speed=64861K/sec
>>>>>>>
>>>>>>> md1 : active raid1 sdh3[0] sdd3[1]
>>>>>>> 7996352 blocks super 1.2 [2/2] [UU]
>>>>>>>
>>>>>>> md0 : active raid1 sdh2[0] sdd2[1]
>>>>>>> 292544 blocks super 1.2 [2/2] [UU]
>>>>>>>
>>>>>>> unused devices: <none>
>>>>>>> if i try to assemble md127
>>>>>>>
>>>>>>>
>>>>>>> mdadm --assemble /dev/md127 /dev/sde1 /dev/sda1 /dev/sdf1 /dev/sdb1 /dev/sdc1 /dev/sdg1
>>>>>>> mdadm: /dev/sde1 is busy - skipping
>>>>>>> mdadm: /dev/sda1 is busy - skipping
>>>>>>> mdadm: /dev/sdf1 is busy - skipping
>>>>>>> mdadm: /dev/sdb1 is busy - skipping
>>>>>>> mdadm: /dev/sdc1 is busy - skipping
>>>>>>> mdadm: /dev/sdg1 is busy - skipping
>>>>>>>
>>>>>>>
>>>>>>> if i try to add one of the disks: mdadm --add /dev/md127 /dev/sdj1
>>>>>>> mdadm: cannot get array info for /dev/md127
>>>>>>>
>>>>>>> if i try:
>>>>>>>
>>>>>>> mdadm --stop /dev/md127
>>>>>>> mdadm: stopped /dev/md127
>>>>>>>
>>>>>>> then running: mdadm --assemble /dev/md127 /dev/sde1 /dev/sda1 /dev/sdf1 /dev/sdb1 /dev/sdc1 /dev/sdg1
>>>>>>>
>>>>>>> returns:
>>>>>>>
>>>>>>> assembled from 5 drives and 1 rebuilding - not enough to start the array
>>>>>>>
>>>>>>> what does it mean ? is my data lost ?
>>>>>>>
>>>>>>> if i examine one of the md127 raid 10 array disks it shows this:
>>>>>>>
>>>>>>> mdadm --examine /dev/sde1
>>>>>>> /dev/sde1:
>>>>>>> Magic : a92b4efc
>>>>>>> Version : 1.2
>>>>>>> Feature Map : 0x0
>>>>>>> Array UUID : ab90d4c8:41a55e14:635025cc:28f0ee76
>>>>>>> Name : ubuntu:data (local to host ubuntu)
>>>>>>> Creation Time : Sat May 10 21:54:56 2014
>>>>>>> Raid Level : raid10
>>>>>>> Raid Devices : 8
>>>>>>>
>>>>>>> Avail Dev Size : 5860268032 (2794.39 GiB 3000.46 GB)
>>>>>>> Array Size : 11720534016 (11177.57 GiB 12001.83 GB)
>>>>>>> Used Dev Size : 5860267008 (2794.39 GiB 3000.46 GB)
>>>>>>> Data Offset : 262144 sectors
>>>>>>> Super Offset : 8 sectors
>>>>>>> State : clean
>>>>>>> Device UUID : a2a5db61:bd79f0ae:99d97f17:21c4a619
>>>>>>>
>>>>>>> Update Time : Tue Oct 28 10:07:18 2014
>>>>>>> Checksum : 409deeb4 - correct
>>>>>>> Events : 8655
>>>>>>>
>>>>>>> Layout : near=2
>>>>>>> Chunk Size : 512K
>>>>>>>
>>>>>>> Device Role : Active device 2
>>>>>>> Array State : AAAAAAAA ('A' == active, '.' == missing)
>>>>>>>
>>>>>>> Used Dev Size : 5860267008 (2794.39 GiB 3000.46 GB) <<--- does this mean i still have my data ?
>>>>>>>
>>>>>>>
>>>>>>> the remaining two disks:
>>>>>>>
>>>>>>> mdadm --examine /dev/sdj1
>>>>>>> mdadm: No md superblock detected on /dev/sdj1.
>>>>>>> mdadm --examine /dev/sdi1
>>>>>>> mdadm: No md superblock detected on /dev/sdi1.
>>>>>>
>>>>>> The --examine output indicates the RAID10 array was 8 members, not 6.
>>>>>> As it stands, you are missing two array members (presumably a mirrored
>>>>>> pair as mdadm won't start the array). Without these you're missing 512K
>>>>>> of every 2M in the array, so your data is toast (well, with a lot of
>>>>>> effort you may recover some files under 1.5M in size).
>>>>>>
>>>>>> Were you expecting sdi1 and sdj1 to have been part of the original
>>>>>> RAID10 array? Have you removed the superblocks from them at any point?
>>>>>> For completeness, what mdadm and kernel versions are you running?
>>>>>>
>>>>>> Cheers,
>>>>>> Robin
>>>>>
>>>>> Thanks for pitching in.here are the responses to you questions:
>>>>>
>>>>> - yes i expected both of them to be part of the array though one of
>>>>> them was just added to the array and didnt finish recovering when
>>>>> raid1 "/" crashed
>>>>>
>>> According to your --examine earlier, the RAID10 rebuild had completed
>>> (it shows the array clean and having all disks active). Are you certain
>>> that the new RAID1 array isn't using disks that used to be part of the
>>> RAID10 array? Regardless, I'd expect the disks to have a superblock if
>>> they were part of either array (unless they've been repartitioned?).
>>>
>>
>> the examine earlier was to one of the 6 disks that belong to the current inactive array.. they're all clean
>> as for raid1/10 arrays, that's what i thought as it happened with me before, but lsblk shows the following:
>>
>> NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
>> sda 8:0 0 2.7T 0 disk
>> └─sda1 8:1 0 2.7T 0 part
>> sdb 8:16 0 2.7T 0 disk
>> └─sdb1 8:17 0 2.7T 0 part
>> sdc 8:32 0 2.7T 0 disk
>> └─sdc1 8:33 0 2.7T 0 part
>> sdd 8:48 0 2.7T 0 disk
>> ├─sdd1 8:49 0 1M 0 part
>> ├─sdd2 8:50 0 286M 0 part
>> │ └─md0 9:0 0 285.7M 0 raid1 /boot
>> ├─sdd3 8:51 0 7.6G 0 part
>> │ └─md1 9:1 0 7.6G 0 raid1 [SWAP]
>> └─sdd4 8:52 0 2.7T 0 part
>> └─md2 9:2 0 2.7T 0 raid1 /
>> sde 8:64 0 2.7T 0 disk
>> └─sde1 8:65 0 2.7T 0 part
>> sdf 8:80 0 2.7T 0 disk
>> └─sdf1 8:81 0 2.7T 0 part
>> sdg 8:96 0 2.7T 0 disk
>> └─sdg1 8:97 0 2.7T 0 part
>> sdh 8:112 0 2.7T 0 disk
>> ├─sdh1 8:113 0 1M 0 part
>> ├─sdh2 8:114 0 286M 0 part
>> │ └─md0 9:0 0 285.7M 0 raid1 /boot
>> ├─sdh3 8:115 0 7.6G 0 part
>> │ └─md1 9:1 0 7.6G 0 raid1 [SWAP]
>> └─sdh4 8:116 0 2.7T 0 part
>> └─md2 9:2 0 2.7T 0 raid1 /
>> sdi 8:128 0 2.7T 0 disk
>> └─sdi1 8:129 0 2.7T 0 part
>> sdj 8:144 0 2.7T 0 disk
>> └─sdj1 8:145 0 2.7T 0 part
>>
>>
>>>>> - i have not removed their superblocks or at least not in a way that i
>>>>> amaware of
>>>>>
>>>>> - mdadm: 3.2.5-5ubuntu4.1
>>>>> - uname -a: 3.13.0-24-generic
>>>>>
>>> That's a pretty old mdadm version, but I don't see anything in the
>>> change logs that looks relevant. Others may be more familiar with issues
>>> though.
>>
>> that's the latest in my current ubuntu repository
>>
>>>
>>>>>
>>>>> PS:
>>>>> I just followed this recovery page:
>>>>> https://raid.wiki.kernel.org/index.php/Recovering_a_failed_software_RAID
>>>>> I managed to reach the last step, whenever i tried to mount it kept
>>>>> asking me for the right file system
>>>>>
>>> That's good documentation anyway. As long as you stick to the overlay
>>> devices your original data is untouched. It's amazing how many people
>>> run --create on their original disks and lose any chance of getting the
>>> data back.
>>
>> unfortunately i used to be/am one of those people.
>> had bad experiences with this before, so i took it slow and went with
>> the overlay documentation.
>> all ebooks i could found about raid speak about different between
>> multiple raid levels but none are thorough when it comes to setting
>> up/troubleshooting raid.
>> and once i do fix my issue, i move on to the next firefighting
>> situation so i lose interest due to lack of time.
>>
>>>
>>>> Correction:i couldn't force assemble the read devices so i issued instead:
>>>> mdadm --create /dev/md089 --assume-clean --level=10 --verbose --raid-devices=8 missing /dev/dm-1 /dev/dm-0 /dev/dm-5 /dev/dm-3 /dev/dm-2 missing /dev/dm-4
>>>> which got it into degraded state
>>>>
>>>
>>> What error did you get when you tried to force assemble (both from mdadm
>>> and anything reported via dmesg)? The device order you're using would
>>> suggest that the missing disks wouldn't be mirrors of each other, so the
>>> data should be okay.
>>
>> mdadm --assemble --force /dev/md100 $OVERLAYS
>> mdadm: /dev/md100 assembled from 5 drives and 1 rebuilding - not enough to start the array.
>>
> That's very odd - all the --examine results for the original disks show
> the array as clean. That would suggest an issue with the installed
> version of mdadm but it doesn't really matter in this case - see below.

when i got ubuntu 14 installed, i issued apt-get update && apt-get upgrade -y
would that have affected anything ? 

>
>> dmesg:
>> [ 6025.573964] md: md100 stopped.
>> [ 6025.595810] md: bind<dm-0>
>> [ 6025.596086] md: bind<dm-5>
>> [ 6025.596364] md: bind<dm-2>
>> [ 6025.596612] md: bind<dm-1>
>> [ 6025.596840] md: bind<dm-4>
>> [ 6025.597026] md: bind<dm-3>
>>
>>>
>>> Can you post the --examine results for all the RAID members? Both for
>>> the original partitions and for the overlay devices after you recreated
>>> the array. There may be differences in data offset, etc. which will
>>> break the filesystem.
>>
>> Original partitions:
>> http://pastebin.com/nHCxidvE
>>
>> overlay:
>> http://pastebin.com/eva4cnu6
>>
>
> Right - these show you have the wrong order. The original partition
> array device roles are:
> sdc1: 2
> sda1: 3
> sdf1: 4
> sdb1: 5
> sdc1: 6
> sdg1: 7
>
> and your overlays are:
> dm-1: 1
> dm-0: 2
> dm-5: 3
> dm-3: 4
> dm-2: 5
> dm-4: 7
>
> So the bad news is that you're missing roles 0 & 1, which will be
> mirrors. That means your array is broken unless any other member disks
> can be found

am i mistaken to think that the order of disks in an array can be known from the "   Device Role : Active device Z " in mdadm --examine /dev/sdXN ?

>
> If you're certain that sdi1 and sdj1 should be in the array then you can
> try recreating the array (in the correct order) and using sdi1/sdj1 in
> the missing slots and see if one option works. I'll assume the overlay
> mapping is as follows (if not, remap as required):
> sda1 -> dm-0
> sdb1 -> dm-1
> sdc1 -> dm-2
> sde1 -> dm-3
> sdf1 -> dm-4
> sdg1 -> dm-5
> sdi1 -> dm-6
> sdj1 -> dm-7
>
> For each of the following orders, you're going to need to:
> - stop the existing array (mdadm -S /dev/md089)
> - create a new array using --assume-clean
> - check for an valid filesystem (fsck -n /dev/md089)
>
> If the fsck returns without errors then try mounting the filesystem and
> see if all looks okay, otherwise move on to the next order.
>
> The orders to try are:
> - /dev/dm-6 missing /dev/dm-3 /dev/dm-0 /dev/dm-4 /dev/dm-1 /dev/dm-2 /dev/dm-5
> - missing /dev/dm-6 /dev/dm-3 /dev/dm-0 /dev/dm-4 /dev/dm-1 /dev/dm-2 /dev/dm-5
> - /dev/dm-7 missing /dev/dm-3 /dev/dm-0 /dev/dm-4 /dev/dm-1 /dev/dm-2 /dev/dm-5
> - missing /dev/dm-7 /dev/dm-3 /dev/dm-0 /dev/dm-4 /dev/dm-1 /dev/dm-2 /dev/dm-5

Thank you for all the help. appreciate it 
>
> Good luck,
> Robin
> --
> ___
> ( ' } | Robin Hill <robin@xxxxxxxxxxxxxxx> |
> / / ) | Little Jim says .... |
> // !! | "He fallen in de water !!" |
 		 	   		  --
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html