rescue an alien md raid5

Harry Mangalam <harry.mangalam@xxxxxxx> · Mon, 23 Feb 2009 10:13:45 -0800

Here's an unusual (long) tale of woe.

We had a USRobotics 8700 NAS appliance with 4 SATA disks in RAID5:
 <http://www.usr.com/support/product-template.asp?prod=8700>
which was a fine (if crude) ARM-based Linux NAS until it stroked out 
at some point, leaving us with a degraded RAID5 and comatose NAS 
device.

We'd like to get the files back of course and I've moved the disks to 
a Linux PC, hooked them up to a cheap Silicon Image 4x SATA 
controller and brought up the whole frankenmess with mdadm.  It 
reported a clean but degraded array:

===============================================================

root@pnh-rcs:/# mdadm --detail /dev/md0
/dev/md0:
        Version : 00.90.03
  Creation Time : Wed Feb 14 16:30:17 2007
     Raid Level : raid5
     Array Size : 1464370176 (1396.53 GiB 1499.52 GB)
  Used Dev Size : 488123392 (465.51 GiB 499.84 GB)
   Raid Devices : 4
  Total Devices : 3
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Fri Dec 12 20:26:27 2008
          State : clean, degraded
 Active Devices : 3
Working Devices : 3
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 64K

           UUID : 7a60cd58:ad85ebdc:3b55d79a:a33c7fe6
         Events : 0.264294

    Number   Major   Minor   RaidDevice State
       0       0        0        0      removed
       1       8       35        1      active sync   /dev/sdc3
       2       8       51        2      active sync   /dev/sdd3
       3       8       67        3      active sync   /dev/sde3
===============================================================

The original 500G Maxtor disks were formatted in 3 partitions as 
follows:

(for /dev/sd[bcde])
disk sdb was bad so I had to replace it.

===============================================================
Disk /dev/sdc: 500.1 GB, 500107862016 bytes
16 heads, 63 sectors/track, 969021 cylinders
Units = cylinders of 1008 * 512 = 516096 bytes
Disk identifier: 0x00000000

   Device Boot      Start         End      Blocks   Id  System
/dev/sdc1               1         261      131543+  83  Linux
/dev/sdc2             262         522      131544   82  Linux swap / 
Solaris
/dev/sdc3             523      969022   488123496+  89  Unknown
===============================================================

I formatted the replacement (different make/layout - Seagate) as a 
single partition:
/dev/sdb1:
===============================================================
Disk /dev/sdb: 500.1 GB, 500107862016 bytes
255 heads, 63 sectors/track, 60801 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x21d01216

   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1               1       60801   488384001   83  Linux
===============================================================

and tried to rebuild the raid by stopping the raid, removing the bad 
disk, adding the new disk.  It came up and reported that it was 
rebuilding.  After several hours, it rebuilt and reported itself 
clean (altho during a reboot, it became /dev/md1 instead of md0)

===============================================================
$ cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] 
[raid4] [raid10]
md1 : active raid5 sdb1[0] sde3[3] sdd3[2] sdc3[1]
      1464370176 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]
===============================================================

===============================================================
$ mdadm --detail /dev/md1
/dev/md1:
        Version : 00.90.03
  Creation Time : Wed Feb 14 16:30:17 2007
     Raid Level : raid5
     Array Size : 1464370176 (1396.53 GiB 1499.52 GB)
  Used Dev Size : 488123392 (465.51 GiB 499.84 GB)
   Raid Devices : 4
  Total Devices : 4
Preferred Minor : 1
    Persistence : Superblock is persistent

    Update Time : Mon Feb 23 09:06:27 2009
          State : clean
 Active Devices : 4
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 64K

           UUID : 7a60cd58:ad85ebdc:3b55d79a:a33c7fe6
         Events : 0.265494

    Number   Major   Minor   RaidDevice State
       0       8       17        0      active sync   /dev/sdb1
       1       8       35        1      active sync   /dev/sdc3
       2       8       51        2      active sync   /dev/sdd3
       3       8       67        3      active sync   /dev/sde3
===============================================================

The docs and files on the USR web site imply that the native 
filesystem was originally XFS, but when i try to mount it as such, I 
can't:

mount -vvv -t xfs /dev/md1 /mnt
mount: fstab path: "/etc/fstab"
mount: lock path:  "/etc/mtab~"
mount: temp path:  "/etc/mtab.tmp"
mount: no LABEL=, no UUID=, going to mount /dev/md1 by path
mount: spec:  "/dev/md1"
mount: node:  "/mnt"
mount: types: "xfs"
mount: opts:  "(null)"
mount: mount(2) syscall: source: "/dev/md1", target: "/mnt", 
filesystemtype: "xfs", mountflags: -1058209792, data: (null)
mount: wrong fs type, bad option, bad superblock on /dev/md1,
       missing codepage or helper program, or other error
       In some cases useful info is found in syslog - try
       dmesg | tail  or so

and when I check dmesg:
[  245.008000] SGI XFS with ACLs, security attributes, realtime, large 
block numbers, no debug enabled
[  245.020000] SGI XFS Quota Management subsystem
[  245.020000] XFS: SB read failed
[  327.696000] md: md0 stopped.
[  327.696000] md: unbind<sdc1>
[  327.696000] md: export_rdev(sdc1)
[  327.696000] md: unbind<sde1>
[  327.696000] md: export_rdev(sde1)
[  327.696000] md: unbind<sdd1>
[  327.696000] md: export_rdev(sdd1)
[  439.660000] XFS: bad magic number
[  439.660000] XFS: SB validate failed

repeated attempts repeat the last 2 lines above.  This implies that 
the superblock is bad and xfs_repair also reports that:
xfs_repair /dev/md1
        - creating 2 worker thread(s)
Phase 1 - find and verify superblock...
bad primary superblock - bad magic number !!!

attempting to find secondary superblock...
...... <lots of ...>  ... 
..found candidate secondary superblock...
unable to verify superblock, continuing...
<lots of ...>  ... 
...found candidate secondary superblock...
unable to verify superblock, continuing...
<lots of ...>  ... 

So my question is what should I do now?  Were those 1st 2 partitions 
(that I didn't create on the replacement disk) important?  Should I 
try to remove the replaced disk, create 3 partitions,  and try again, 
or am I just well and truly hosed?

-- 
Harry Mangalam - Research Computing, NACS, E2148, Engineering Gateway, 
UC Irvine 92697  949 824-0084(o), 949 285-4487(c)
---
Good judgment comes from experience; 
Experience comes from bad judgment. [F. Brooks.]
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html