Re: mdadm expanded 8 disk raid 6 fails in new server, 5 original devices show no md superblock

Phil Turmel <philip@xxxxxxxxxx> · Sat, 11 Jan 2014 12:47:33 -0500

Hi Julian,

Very good report.  I think we can help.

On 01/11/2014 01:42 AM, Großkreutz, Julian wrote:
> Dear all, dear Neil (thanks for pointing me to this list),
> 
> I am in desperate need of help. mdadm is fantastic work, and I have
> relied on mdadm for years to run very stable server systems, never had
> major problems I could not solve.
> 
> This time its different:
> 
> On a Centos 6.x (can't remember) initially in 2012:
> 
> parted to create GPT partitions on 5 Seagate drives 3TB each
> 
> Model: ATA ST3000DM001-9YN1 (scsi)
> Disk /dev/sda: 5860533168s  # sd[bcde] identical
> Sector size (logical/physical): 512B/4096B
> Partition Table: gpt
> 
> Number  Start     End          Size         File system  Name     Flags
> 1      2048s     1953791s     1951744s     ext4                  boot
> 2      1955840s  5860532223s  5858576384s               primary  raid

Ok.

Please also show the partition tables for the /dev/sd[fgh].

> I used an unknown mdadm version including unknown offset parameters for
> 4k alignment to create
> 
> /dev/sd[abcde]1 as /dev/md0 raid 1 for booting (1 GB)
> /dev/sd[abcde]2 as /dev/md1 raid 6 for data (9 TB) lvm physical drive
> 
> Later added 3 more 3T identical Seagate drives with identical partition
> layout, but later firmware.
> 
> Using likely a different newer version of mdadm I expanded RAID 6 by 2
> drives and added 1 spare.
> 
> /dev/md1 was at 15 TB gross, 13 TB usable, expanded pv
> 
> Ran fine

Ok.  Your evidence below has some evidence suggesting you created the
larger array from scratch instead of using --grow.  Do you remember?

> Then I moved the 8 disks to a new server with an hba and backplane,
> array did not start because mdadm did not find the superblocks on the
> original 5 devices /dev/sd[abcde]2. Moving the disks back to the old
> server the error did not vanish. Using a centos 6.3 livecd, I got the
> following:
> 
> [root@livecd ~]# mdadm -Evvvvs /dev/sd[abcdefgh]2
> mdadm: No md superblock detected on /dev/sda2.
> mdadm: No md superblock detected on /dev/sdb2.
> mdadm: No md superblock detected on /dev/sdc2.
> mdadm: No md superblock detected on /dev/sdd2.
> mdadm: No md superblock detected on /dev/sde2.
> 
> /dev/sdf2:
>               Magic : a92b4efc
>             Version : 1.2
>         Feature Map : 0x0
>          Array UUID : 32d82f84:fe30ac2e:f589aaef:bdd3e4c7
>                Name : 1
>       Creation Time : Wed Jul 31 18:24:38 2013

Note this creation time...  would have been 2012 if you had used --grow.

>          Raid Level : raid6
>        Raid Devices : 7
> 
>      Avail Dev Size : 5858314240 (2793.46 GiB 2999.46 GB)
>          Array Size : 29285793280 (13964.55 GiB 14994.33 GB)
>       Used Dev Size : 5857158656 (2792.91 GiB 2998.87 GB)

This used dev size is very odd.  The unused space after the data area is
1155584 sectors (>500MiB).

>         Data Offset : 262144 sectors
>        Super Offset : 8 sectors
>               State : active
>         Device UUID : d5a16cb2:ff41b9a5:cbbf12b7:3750026d
> 
>         Update Time : Mon Dec 16 01:16:26 2013
>            Checksum : ee921c43 - correct
>              Events : 327
> 
>              Layout : left-symmetric
>          Chunk Size : 256K
> 
>       Device Role : Active device 5
>       Array State : A.AAAAA ('A' == active, '.' == missing)
> 
> /dev/sdg2:
>               Magic : a92b4efc
>             Version : 1.2
>         Feature Map : 0x0
>          Array UUID : 32d82f84:fe30ac2e:f589aaef:bdd3e4c7
>                Name : 1
>       Creation Time : Wed Jul 31 18:24:38 2013
>          Raid Level : raid6
>        Raid Devices : 7
> 
>      Avail Dev Size : 5858314240 (2793.46 GiB 2999.46 GB)
>          Array Size : 29285793280 (13964.55 GiB 14994.33 GB)
>       Used Dev Size : 5857158656 (2792.91 GiB 2998.87 GB)
>         Data Offset : 262144 sectors
>        Super Offset : 8 sectors
>               State : active
>         Device UUID : a1e1e51b:d8912985:e51207a9:1d718292
> 
>         Update Time : Mon Dec 16 01:16:26 2013
>            Checksum : 4ef01fe9 - correct
>              Events : 327
> 
>              Layout : left-symmetric
>          Chunk Size : 256K
> 
>         Device Role : Active device 6
>         Array State : A.AAAAA ('A' == active, '.' == missing)
> 
> /dev/sdh2:
>               Magic : a92b4efc
>             Version : 1.2
>         Feature Map : 0x0
>          Array UUID : 32d82f84:fe30ac2e:f589aaef:bdd3e4c7
>                Name : 1
>       Creation Time : Wed Jul 31 18:24:38 2013
>          Raid Level : raid6
>        Raid Devices : 7
> 
>      Avail Dev Size : 5858314240 (2793.46 GiB 2999.46 GB)
>          Array Size : 29285793280 (13964.55 GiB 14994.33 GB)
>       Used Dev Size : 5857158656 (2792.91 GiB 2998.87 GB)
>         Data Offset : 262144 sectors
>        Super Offset : 8 sectors
>               State : active
>         Device UUID : 030cb9a7:76a48b3c:b3448369:fcf013e1
> 
>         Update Time : Mon Dec 16 01:16:26 2013
>            Checksum : a1330e97 - correct
>              Events : 327
> 
>              Layout : left-symmetric
>          Chunk Size : 256K
> 
>        Device Role : spare
>        Array State : A.AAAAA ('A' == active, '.' == missing)
> 
> 
> I suspect that the superblock of the original 5 devices is at a
> different location, possibly because they where created with a different
> mdadm version, i.e. at the end of the partitions. Booting the drives
> with the hba in IT (non-raid) mode on the new server may have introduced
> an initialization on the first five drive at the end of the partitions
> because I can hexdump something with "EFI PART" in the last 64 kb in all
> 8 partitions used for the raid 6, which may not have affected the 3
> added drives which show metadata 1.2.

The "EFI PART" is part of the backup copy of the GPT.  All the drives in
a working array will have the same metadata version (superblock
location) even if the data offsets are different.

I would suggest hexdumping entire devices looking for the MD superblock
magic value, which will always be at the start of a 4k-aligned block.

Show (will take a long time, even with the big block size):

for x in /dev/sd[a-e]2 ; echo -e "\nDevice $x" ; dd if=$x bs=1M |hexdump
-C |grep "000  fc 4e 2b a9" ; done

For any candidates found, hexdump the whole 4k block for us.

> If any of You can help me sort this I would greatly appreciate it. I
> guess I need the mdadm version where I can set the data offset
> differently for each device, but it doesn't compile with an error in
> sha1.c:
> 
> sha1.h:29:22: Fehler: ansidecl.h: Datei oder Verzeichnis nicht gefunden
> (didn't find ansidecl.h, error in German)

You probably need some *-dev packages.  I don't use the RHEL platform,
so I'm not sure what you'd need.  In the ubuntu world, it'd be the
"build-essentials" meta-package.

> What would be the best way to proceed? There is critical data on this
> raid, not fully backed up.
> 
> (UPD'T)
> 
> Thanks for getting back.
> 
> Yes, it's bad, I know, also tweaking without keeping exact records of
> versions and offsets.
> 
> I am, however, rather sure that nothing was written to the disks when I
> plugged them into the NEW server, unless starting up a live cd causes an
> automatic assemble attempt with an update to the superblocks. That I
> cannot exclude.
> 
> What I did so far w/o writing to the disks
> 
> get non-00 data at the beginning of sda2:
> 
> dd if=/dev/sda skip=1955840 bs=512 count=10 | hexdump -C | grep [^00]

FWIW, you could have combined "if=/dev/sda skip=1955840" into
"if=/dev/sda2" . . . :-)

> gives me
> 
> 00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
> |................|
>         *
> 00001000  1e b5 54 51 20 4c 56 4d  32 20 78 5b 35 41 25 72  |..TQ LVM2
> x[5A%r|
> 00001010  30 4e 2a 3e 01 00 00 00  00 10 00 00 00 00 00 00  |
> 0N*>............|
> 00001020  00 00 02 00 00 00 00 00  00 00 00 00 00 00 00 00
> |................|
> 00001030  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
> |................|
> *
> 00001200  76 67 5f 6e 65 64 69 67  73 30 32 20 7b 0a 69 64  |vg_nedigs02
> {.id|
> 00001210  20 3d 20 22 32 4c 62 48  71 64 2d 72 67 42 74 2d  | =
> "2LbHqd-rgBt-|
> 00001220  45 4a 75 31 2d 32 52 36  31 2d 41 35 7a 74 2d 6e  |
> EJu1-2R61-A5zt-n|
> 00001230  49 58 53 2d 66 79 4f 36  33 73 22 0a 73 65 71 6e  |
> IXS-fyO63s".seqn|
> 00001240  6f 20 3d 20 37 0a 66 6f  72 6d 61 74 20 3d 20 22  |o =
> 7.format = "|
> 00001250  6c 76 6d 32 22 20 23 20  69 6e 66 6f 72 6d 61 74  |lvm2" #
> informat|
> (cont'd)

This implies that /dev/sda2 is the first device in a raid5/6 that uses
metadata 0.9 or 1.0.  You've found the LVM PV signature, which starts at
4k into a PV.  Theoretically, this could be a stray, abandoned signature
from the original array, with the real LVM signature at the 262144
offset.  Show:

dd if=/dev/sda2 skip=262144 count=16 |hexdump -C

> 
> but on /dev/sdb
> 
> 00000000  5f 80 00 00 5f 80 01 00  5f 80 02 00 5f 80 03 00  |
> _..._..._..._...|
> 00000010  5f 80 04 00 5f 80 0c 00  5f 80 0d 00 00 00 00 00  |
> _..._..._.......|
> 00000020  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
> |................|
> *
> 00001000  60 80 00 00 60 80 01 00  60 80 02 00 60 80 03 00  |
> `...`...`...`...|
> 00001010  60 80 04 00 60 80 0c 00  60 80 0d 00 00 00 00 00  |
> `...`...`.......|
> 00001020  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
> |................|
> *
> 00001400
> 
> so my initial guess that the data may start at 00001000 did not pan out.

No, but with parity raid scattering data amongst the participating
devices, the report on /dev/sdb2 is expected.

> Does anybody have an idea of how to reliably identify an mdadm
> superblock in a hexdump of the drive ?

Above.

> And second, have I got my numbers right ? In parted I see the block
> count, and when I multiply 512 (not 4096!) with the total count I get 3
> TB, so I think I have to use bs=512 in dd to get teh partition
> boundaries correct.

dd uses bs=512 as the default.  And it can access the partitions directly.

> As for the last state: one drive was set faulty, apparently, but the
> spare had not been integrated. I may have gotten caught in a bug
> described by Neil Brown, where on shutdown disk were wrongly reported,
> and subsequently superblock information was overwritten.

Possible.  If so, you may not find any superblocks with the grep above.

> I don't have NAS/SAN storage space to make identical copies of 5x3 TB,
> but maybe I should buy 5 more disks and do a dd mirror so I have a
> backup of the current state.

We can do some more non-destructive investigation first.

Regards,

Phil
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html