Re: can i recover an all spare raid10 array ?

Robin Hill <robin@xxxxxxxxxxxxxxx> · Tue, 28 Oct 2014 20:02:39 +0000

On Tue Oct 28, 2014 at 09:11:21PM +0200, Roland RoLaNd wrote:

> 
> 
> ----------------------------------------
> > Date: Tue, 28 Oct 2014 18:34:22 +0000
> > From: robin@xxxxxxxxxxxxxxx
> > To: r_o_l_a_n_d@xxxxxxxxxxx
> > CC: robin@xxxxxxxxxxxxxxx; linux-raid@xxxxxxxxxxxxxxx
> > Subject: Re: can i recover an all spare raid10 array ?
> >
> > Please don't top post, it makes conversations very difficult to follow.
> > Responses should go at the bottom, or interleaved with the previous post
> > if responding to particular points. I've moved your previous responses
> > to keep the conversation flow straight.
> >
> > On Tue Oct 28, 2014 at 07:30:50PM +0200, Roland RoLaNd wrote:
> >>
> >>> From: r_o_l_a_n_d@xxxxxxxxxxx
> >>> To: robin@xxxxxxxxxxxxxxx
> >>> CC: linux-raid@xxxxxxxxxxxxxxx
> >>> Subject: Re: can i recover an all spare raid10 array ?
> >>> Date: Tue, 28 Oct 2014 19:29:25 +0200
> >>>
> >>>> Date: Tue, 28 Oct 2014 17:01:11 +0000
> >>>> From: robin@xxxxxxxxxxxxxxx
> >>>> To: r_o_l_a_n_d@xxxxxxxxxxx
> >>>> CC: linux-raid@xxxxxxxxxxxxxxx
> >>>> Subject: Re: can i recover an all spare raid10 array ?
> >>>>
> >>>> On Tue Oct 28, 2014 at 06:22:11PM +0200, Roland RoLaNd wrote:
> >>>>
> >>>>> I have two raid arrays on my system:
> >>>>> raid1: /dev/sdd1 /dev/sdh1
> >>>>> raid10: /dev/sde1 /dev/sda1 /dev/sdf1 /dec/sdb1 /dev/sdc1 /dev/sdg1
> >>>>>
> >>>>>
> >>>>> two disks had bad sectors: sdd and sdf <<-- they both got hot swapped.
> >>>>> i added sdf back to raid10 and recovery took place but adding sdd1 to
> >>>>> raid1 proved to be troublesome
> >>>>> as i didn't have anything important on '/' i formatted and installed
> >>>>> ubuntu 14 on raid1
> >>>>>
> >>>>> now system is up on raid 1, but raid10 (md127) is inactive
> >>>>>
> >>>>> cat /proc/mdstat
> >>>>>
> >>>>> Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
> >>>>> md127 : inactive sde1[2](S) sdg1[8](S) sdc1[6](S) sdb1[5](S) sdf1[4](S) sda1[3](S)
> >>>>> 17580804096 blocks super 1.2
> >>>>>
> >>>>> md2 : active raid1 sdh4[0] sdd4[1]
> >>>>> 2921839424 blocks super 1.2 [2/2] [UU]
> >>>>> [==>..................] resync = 10.4% (304322368/2921839424) finish=672.5min speed=64861K/sec
> >>>>>
> >>>>> md1 : active raid1 sdh3[0] sdd3[1]
> >>>>> 7996352 blocks super 1.2 [2/2] [UU]
> >>>>>
> >>>>> md0 : active raid1 sdh2[0] sdd2[1]
> >>>>> 292544 blocks super 1.2 [2/2] [UU]
> >>>>>
> >>>>> unused devices: <none>
> >>>>> if i try to assemble md127
> >>>>>
> >>>>>
> >>>>> mdadm --assemble /dev/md127 /dev/sde1 /dev/sda1 /dev/sdf1 /dev/sdb1 /dev/sdc1 /dev/sdg1
> >>>>> mdadm: /dev/sde1 is busy - skipping
> >>>>> mdadm: /dev/sda1 is busy - skipping
> >>>>> mdadm: /dev/sdf1 is busy - skipping
> >>>>> mdadm: /dev/sdb1 is busy - skipping
> >>>>> mdadm: /dev/sdc1 is busy - skipping
> >>>>> mdadm: /dev/sdg1 is busy - skipping
> >>>>>
> >>>>>
> >>>>> if i try to add one of the disks: mdadm --add /dev/md127 /dev/sdj1
> >>>>> mdadm: cannot get array info for /dev/md127
> >>>>>
> >>>>> if i try:
> >>>>>
> >>>>> mdadm --stop /dev/md127
> >>>>> mdadm: stopped /dev/md127
> >>>>>
> >>>>> then running: mdadm --assemble /dev/md127 /dev/sde1 /dev/sda1 /dev/sdf1 /dev/sdb1 /dev/sdc1 /dev/sdg1
> >>>>>
> >>>>> returns:
> >>>>>
> >>>>> assembled from 5 drives and 1 rebuilding - not enough to start the array
> >>>>>
> >>>>> what does it mean ? is my data lost ?
> >>>>>
> >>>>> if i examine one of the md127 raid 10 array disks it shows this:
> >>>>>
> >>>>> mdadm --examine /dev/sde1
> >>>>> /dev/sde1:
> >>>>> Magic : a92b4efc
> >>>>> Version : 1.2
> >>>>> Feature Map : 0x0
> >>>>> Array UUID : ab90d4c8:41a55e14:635025cc:28f0ee76
> >>>>> Name : ubuntu:data (local to host ubuntu)
> >>>>> Creation Time : Sat May 10 21:54:56 2014
> >>>>> Raid Level : raid10
> >>>>> Raid Devices : 8
> >>>>>
> >>>>> Avail Dev Size : 5860268032 (2794.39 GiB 3000.46 GB)
> >>>>> Array Size : 11720534016 (11177.57 GiB 12001.83 GB)
> >>>>> Used Dev Size : 5860267008 (2794.39 GiB 3000.46 GB)
> >>>>> Data Offset : 262144 sectors
> >>>>> Super Offset : 8 sectors
> >>>>> State : clean
> >>>>> Device UUID : a2a5db61:bd79f0ae:99d97f17:21c4a619
> >>>>>
> >>>>> Update Time : Tue Oct 28 10:07:18 2014
> >>>>> Checksum : 409deeb4 - correct
> >>>>> Events : 8655
> >>>>>
> >>>>> Layout : near=2
> >>>>> Chunk Size : 512K
> >>>>>
> >>>>> Device Role : Active device 2
> >>>>> Array State : AAAAAAAA ('A' == active, '.' == missing)
> >>>>>
> >>>>> Used Dev Size : 5860267008 (2794.39 GiB 3000.46 GB) <<--- does this mean i still have my data ?
> >>>>>
> >>>>>
> >>>>> the remaining two disks:
> >>>>>
> >>>>> mdadm --examine /dev/sdj1
> >>>>> mdadm: No md superblock detected on /dev/sdj1.
> >>>>> mdadm --examine /dev/sdi1
> >>>>> mdadm: No md superblock detected on /dev/sdi1.
> >>>>
> >>>> The --examine output indicates the RAID10 array was 8 members, not 6.
> >>>> As it stands, you are missing two array members (presumably a mirrored
> >>>> pair as mdadm won't start the array). Without these you're missing 512K
> >>>> of every 2M in the array, so your data is toast (well, with a lot of
> >>>> effort you may recover some files under 1.5M in size).
> >>>>
> >>>> Were you expecting sdi1 and sdj1 to have been part of the original
> >>>> RAID10 array? Have you removed the superblocks from them at any point?
> >>>> For completeness, what mdadm and kernel versions are you running?
> >>>>
> >>>> Cheers,
> >>>> Robin
> >>>
> >>> Thanks for pitching in.here are the responses to you questions:
> >>>
> >>> - yes i expected both of them to be part of the array though one of
> >>> them was just added to the array and didnt finish recovering when
> >>> raid1 "/" crashed
> >>>
> > According to your --examine earlier, the RAID10 rebuild had completed
> > (it shows the array clean and having all disks active). Are you certain
> > that the new RAID1 array isn't using disks that used to be part of the
> > RAID10 array? Regardless, I'd expect the disks to have a superblock if
> > they were part of either array (unless they've been repartitioned?).
> >
> 
> the examine earlier was to one of the 6 disks that belong to the current inactive array.. they're all clean
> as for raid1/10 arrays,  that's what i thought as it happened with me before, but lsblk shows the following:
> 
> NAME    MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
> sda       8:0    0   2.7T  0 disk  
> └─sda1    8:1    0   2.7T  0 part  
> sdb       8:16   0   2.7T  0 disk  
> └─sdb1    8:17   0   2.7T  0 part  
> sdc       8:32   0   2.7T  0 disk  
> └─sdc1    8:33   0   2.7T  0 part  
> sdd       8:48   0   2.7T  0 disk  
> ├─sdd1    8:49   0     1M  0 part  
> ├─sdd2    8:50   0   286M  0 part  
> │ └─md0   9:0    0 285.7M  0 raid1 /boot
> ├─sdd3    8:51   0   7.6G  0 part  
> │ └─md1   9:1    0   7.6G  0 raid1 [SWAP]
> └─sdd4    8:52   0   2.7T  0 part  
>   └─md2   9:2    0   2.7T  0 raid1 /
> sde       8:64   0   2.7T  0 disk  
> └─sde1    8:65   0   2.7T  0 part  
> sdf       8:80   0   2.7T  0 disk  
> └─sdf1    8:81   0   2.7T  0 part  
> sdg       8:96   0   2.7T  0 disk  
> └─sdg1    8:97   0   2.7T  0 part  
> sdh       8:112  0   2.7T  0 disk  
> ├─sdh1    8:113  0     1M  0 part  
> ├─sdh2    8:114  0   286M  0 part  
> │ └─md0   9:0    0 285.7M  0 raid1 /boot
> ├─sdh3    8:115  0   7.6G  0 part  
> │ └─md1   9:1    0   7.6G  0 raid1 [SWAP]
> └─sdh4    8:116  0   2.7T  0 part  
>   └─md2   9:2    0   2.7T  0 raid1 /
> sdi       8:128  0   2.7T  0 disk  
> └─sdi1    8:129  0   2.7T  0 part  
> sdj       8:144  0   2.7T  0 disk  
> └─sdj1    8:145  0   2.7T  0 part 
> 
> 
> >>> - i have not removed their superblocks or at least not in a way that i
> >>> amaware of
> >>>
> >>> - mdadm: 3.2.5-5ubuntu4.1
> >>> - uname -a: 3.13.0-24-generic
> >>>
> > That's a pretty old mdadm version, but I don't see anything in the
> > change logs that looks relevant. Others may be more familiar with issues
> > though.
> 
> that's the latest in my current ubuntu repository
> 
> >
> >>>
> >>> PS:
> >>> I just followed this recovery page:
> >>> https://raid.wiki.kernel.org/index.php/Recovering_a_failed_software_RAID
> >>> I managed to reach the last step, whenever i tried to mount it kept
> >>> asking me for the right file system
> >>>
> > That's good documentation anyway. As long as you stick to the overlay
> > devices your original data is untouched. It's amazing how many people
> > run --create on their original disks and lose any chance of getting the
> > data back.
> 
> unfortunately i used to be/am one of those people. 
>  had bad experiences with this before, so i took it slow and went with
> the overlay documentation.
> all ebooks i could found about raid speak about different between
> multiple raid levels but none are thorough when it comes to  setting
> up/troubleshooting  raid. 
> and once i do fix my issue, i move on to the next firefighting
> situation so i lose interest due to lack of time.
> 
> >
> >> Correction:i couldn't force assemble the read devices so i issued instead:
> >> mdadm --create /dev/md089 --assume-clean --level=10 --verbose --raid-devices=8 missing /dev/dm-1 /dev/dm-0 /dev/dm-5 /dev/dm-3 /dev/dm-2 missing /dev/dm-4
> >> which got it into degraded state
> >>
> >
> > What error did you get when you tried to force assemble (both from mdadm
> > and anything reported via dmesg)? The device order you're using would
> > suggest that the missing disks wouldn't be mirrors of each other, so the
> > data should be okay.
> 
> mdadm --assemble --force /dev/md100 $OVERLAYS
> mdadm: /dev/md100 assembled from 5 drives and  1 rebuilding - not enough to start the array.
> 
That's very odd - all the --examine results for the original disks show
the array as clean. That would suggest an issue with the installed
version of mdadm but it doesn't really matter in this case - see below.

> dmesg:
> [ 6025.573964] md: md100 stopped.
> [ 6025.595810] md: bind<dm-0>
> [ 6025.596086] md: bind<dm-5>
> [ 6025.596364] md: bind<dm-2>
> [ 6025.596612] md: bind<dm-1>
> [ 6025.596840] md: bind<dm-4>
> [ 6025.597026] md: bind<dm-3>
> 
> >
> > Can you post the --examine results for all the RAID members? Both for
> > the original partitions and for the overlay devices after you recreated
> > the array. There may be differences in data offset, etc. which will
> > break the filesystem.
> 
> Original partitions:
> http://pastebin.com/nHCxidvE
> 
> overlay:
> http://pastebin.com/eva4cnu6
> 

Right - these show you have the wrong order. The original partition
array device roles are:
  sdc1: 2
  sda1: 3
  sdf1: 4
  sdb1: 5
  sdc1: 6
  sdg1: 7

and your overlays are:
  dm-1: 1
  dm-0: 2
  dm-5: 3
  dm-3: 4
  dm-2: 5
  dm-4: 7

So the bad news is that you're missing roles 0 & 1, which will be
mirrors. That means your array is broken unless any other member disks
can be found.

If you're certain that sdi1 and sdj1 should be in the array then you can
try recreating the array (in the correct order) and using sdi1/sdj1 in
the missing slots and see if one option works. I'll assume the overlay
mapping is as follows (if not, remap as required):
         sda1 -> dm-0
         sdb1 -> dm-1
         sdc1 -> dm-2
         sde1 -> dm-3
         sdf1 -> dm-4
         sdg1 -> dm-5
         sdi1 -> dm-6
         sdj1 -> dm-7

For each of the following orders, you're going to need to:
    - stop the existing array (mdadm -S /dev/md089)
    - create a new array using --assume-clean
    - check for an valid filesystem (fsck -n /dev/md089)

If the fsck returns without errors then try mounting the filesystem and
see if all looks okay, otherwise move on to the next order.

The orders to try are:
- /dev/dm-6 missing /dev/dm-3 /dev/dm-0 /dev/dm-4 /dev/dm-1 /dev/dm-2 /dev/dm-5
- missing /dev/dm-6 /dev/dm-3 /dev/dm-0 /dev/dm-4 /dev/dm-1 /dev/dm-2 /dev/dm-5
- /dev/dm-7 missing /dev/dm-3 /dev/dm-0 /dev/dm-4 /dev/dm-1 /dev/dm-2 /dev/dm-5
- missing /dev/dm-7 /dev/dm-3 /dev/dm-0 /dev/dm-4 /dev/dm-1 /dev/dm-2 /dev/dm-5

Good luck,
    Robin
-- 
     ___        
    ( ' }     |       Robin Hill        <robin@xxxxxxxxxxxxxxx> |
   / / )      | Little Jim says ....                            |
  // !!       |      "He fallen in de water !!"                 |
Attachment:
signature.asc

Description: Digital signature