Re: mdadm expanded 8 disk raid 6 fails in new server, 5 original devices show no md superblock

Phil Turmel <philip@xxxxxxxxxx> · Tue, 14 Jan 2014 08:14:54 -0500

On 01/14/2014 05:31 AM, Großkreutz, Julian wrote:
> Hi Phil,
> 
> thanks again for bearing with me.

No problem.

>>>>> Model: ATA ST3000DM001-9YN1 (scsi)
>>
>> Aside: This model looks familiar.  I'm pretty sure these drives are
>> desktop models that lack scterc support.  Meaning they are *not*
>> generally suitable for raid duty.  Search the archives for combinations
>> of "timeout mismatch", "scterc", "URE", and "scrub" for a full
>> explanation.  If I've guessed correctly, you *must* use the driver
>> timeout work-around before proceeding.
>>
> 
> Yes I did, and smartctl showed no significant problems.

?.  What did "smartctl -l scterc" say?  If it says unsupported, you have
a problem.  The workaround is to set the driver timeouts to ~180 seconds
for each such drive.

If scterc is supported, but disabled, you can set 7-second timeouts with
"smartctl -l scterc,70,70", but you must do so on every power cycle.
Either way, you need boot-time scripting or distro support.

Raid-rated drives power up with a reasonable setting here.

> The 10 year old
> server (supermicro enterprise grade dual Xeon with 8 GB ECC RAM) had
> started to create problems early January which is why I wanted to move
> the drives to a new server in the first place, to then transfer the data
> to a new set of enterprise grade disks. I had checked the memory and the
> disks in a burn in for several days including time out and power saving
> before I set up the raid 2012/2013, and did not have any issues then.

Ok.  This makes sense.

> One of the reasons I tend use mdadm is that I am able to utilize
> existing hardware to create bridging solutions until money comes in for
> better hardware, and moving an mdadm raid has so far never created a
> serious problem.

Many people discover the timeout problem the first time they have an
otherwise correctable read error in their array, and the array falls
apart instead.  This list's archives are well-populated with such cases.

>>> So attached You will find hexdumps of 64k of /sda/sd[a-h]2 at sector 0
>>> and 262144 which shows the superblock 1.2 on sd[fgh]2, not on sd[a-e]2,
>>> but may help to identify data_offset; I suspect it is 2048 on sd[a-e]2
>>> and 262144 on sd[fgh]2.
>>>
>>
>> Jackpot!  LVM2 embedded backup data at the correct location for mdadm
>> data offset == 262144.  And on /dev/sda2, which is the only device that
>> should have it (first device in the raid).
>>
>> From /dev/sda2 @ 262144:
>>
>>> 00001200  76 67 5f 6e 65 64 69 67  73 30 32 20 5d 0a 69 64  |vg_nedigs02 ].id|
>>> 00001210  20 3d 20 22 32 4c 62 48  71 64 2d 72 67 42 9f 6e  | = "2LbHqd-rgB.n|
>>> 00001220  45 4a 75 31 2d 32 52 36  31 2d 41 35 f5 75 2d 6e  |EJu1-2R61-A5.u-n|
>>> 00001230  49 58 53 2d 66 79 4f 36  33 73 22 0a 73 65 3a 01  |IXS-fyO63s".se:.|
>>> 00001240  6f 20 3d 20 33 36 0a 66  6f 72 6d 61 ca 24 3d 20  |o = 36.forma.$= |
>>> 00001250  22 6c 76 6d 32 22 20 23  20 69 6e 66 6f 72 6b ac  |"lvm2" # infork.|
>> ...
>>> 00001a70  20 31 33 37 35 32 38 37  39 37 39 09 23 20 d2 32  | 1375287979.# .2|
>>> 00001a80  64 20 4a 75 6c 20 33 31  20 31 38 3a af 37 3a 31  |d Jul 31 18:.7:1|
>>> 00001a90  39 20 32 30 31 33 0a 0a  00 00 00 00 00 00 ee 12  |9 2013..........|
>>
>> Note the creation date/time at the end (with a corrupted byte):
>>
>> Jul 31 18:?7:19 2013
>>
>> There are other corrupted bytes scattered around.  I'd be worried about
>> the RAM in this machine.  Since you are using non-enterprise drives, I'm
>> going to go out on a limb here and guess that the server doesn't have
>> ECC ram...
> see above

Understood.  With really old memory, double-faults in the ECC could have
panic'd the server, leaving scattered data unwritten.

>> Consider performing an extended memcheck run to see what's going on.
>> Maybe move the entire stack of disks to another server.
>>
> Thats what I did initially, moved it back because it failed, now will
> move again into the new server before proceeding.

Ok.

>> Based on the signature discovered above, we should be able to --create
>> --assume-clean with the modern default data offset.  We know the
>> following device roles:
>>
>> /dev/sda2 == 0
>> /dev/sdf2 == 5
>> /dev/sdg2 == 6
>> /dev/sdh2 == spare
>>
>> So /dev/sdh2 should be left out until the array is working.
>>
>> Please re-execute the "mdadm -E" reports for /dev/sd[fgh]2 and show them
>> uncut.  (Use the lasted mdadm.)  That should fill in the likely device
>> order of the remaining drives.

Hmmm.  Typo on my part: s/lasted/latest/  Newer mdadm will give more
information.  In particular, I wanted the tail of each report where each
device lists what it last knew about all of the other devices' roles.

> [root@livecd mnt]# mdadm -E /dev/sd[fgh]2
> 
> /dev/sdf2:
>           Magic : a92b4efc
>         Version : 1.2
>     Feature Map : 0x0
>      Array UUID : 32d82f84:fe30ac2e:f589aaef:bdd3e4c7
>            Name : 1
>   Creation Time : Wed Jul 31 18:24:38 2013
>      Raid Level : raid6
>    Raid Devices : 7
> 
>  Avail Dev Size : 5858314240 (2793.46 GiB 2999.46 GB)
>      Array Size : 29285793280 (13964.55 GiB 14994.33 GB)
>   Used Dev Size : 5857158656 (2792.91 GiB 2998.87 GB)
>     Data Offset : 262144 sectors
>    Super Offset : 8 sectors
>           State : active
>     Device UUID : d5a16cb2:ff41b9a5:cbbf12b7:3750026d
> 
>     Update Time : Mon Dec 16 01:16:26 2013
>        Checksum : ee921c43 - correct
>          Events : 327
> 
>          Layout : left-symmetric
>      Chunk Size : 256K
> 
>    Device Role : Active device 5
>    Array State : A.AAAAA ('A' == active, '.' == missing)

I was expecting more info after this.

> /dev/sdg2:
>           Magic : a92b4efc
>         Version : 1.2
>     Feature Map : 0x0
>      Array UUID : 32d82f84:fe30ac2e:f589aaef:bdd3e4c7
>            Name : 1
>   Creation Time : Wed Jul 31 18:24:38 2013
>      Raid Level : raid6
>    Raid Devices : 7
> 
>  Avail Dev Size : 5858314240 (2793.46 GiB 2999.46 GB)
>      Array Size : 29285793280 (13964.55 GiB 14994.33 GB)
>   Used Dev Size : 5857158656 (2792.91 GiB 2998.87 GB)
>     Data Offset : 262144 sectors
>    Super Offset : 8 sectors
>           State : active
>     Device UUID : a1e1e51b:d8912985:e51207a9:1d718292
> 
>     Update Time : Mon Dec 16 01:16:26 2013
>        Checksum : 4ef01fe9 - correct
>          Events : 327
> 
>          Layout : left-symmetric
>      Chunk Size : 256K
> 
>    Device Role : Active device 6
>    Array State : A.AAAAA ('A' == active, '.' == missing)

And here.

> /dev/sdh2:
>           Magic : a92b4efc
>         Version : 1.2
>     Feature Map : 0x0
>      Array UUID : 32d82f84:fe30ac2e:f589aaef:bdd3e4c7
>            Name : 1
>   Creation Time : Wed Jul 31 18:24:38 2013
>      Raid Level : raid6
>    Raid Devices : 7
> 
>  Avail Dev Size : 5858314240 (2793.46 GiB 2999.46 GB)
>      Array Size : 29285793280 (13964.55 GiB 14994.33 GB)
>   Used Dev Size : 5857158656 (2792.91 GiB 2998.87 GB)
>     Data Offset : 262144 sectors
>    Super Offset : 8 sectors
>           State : active
>     Device UUID : 030cb9a7:76a48b3c:b3448369:fcf013e1
> 
>     Update Time : Mon Dec 16 01:16:26 2013
>        Checksum : a1330e97 - correct
>          Events : 327
> 
>          Layout : left-symmetric
>      Chunk Size : 256K
> 
>    Device Role : spare
>    Array State : A.AAAAA ('A' == active, '.' == missing)

And here.

>> Also, it is important that you document which drive serial numbers are
>> currently occupying the different device names.  An excerpt from "ls -l
>> /dev/disk/by-id/" would do.
> 
> scsi-SATA_ST3000DM001-9YN_S1F026VJ -> ../../sda
> scsi-SATA_ST3000DM001-9YN_W1F0TB3C -> ../../sdb
> scsi-SATA_ST3000DM001-9YN_S1F04KAK -> ../../sdc
> scsi-SATA_ST3000DM001-9YN_W1F0RWJY -> ../../sdd
> scsi-SATA_ST3000DM001-9YN_S1F08N7Q -> ../../sde
> scsi-SATA_ST3000DM001-9YN_Z1F1F3TC -> ../../sdf
> scsi-SATA_ST3000DM001-9YN_W1F1ZZ9T -> ../../sdg
> scsi-SATA_ST3000DM001-9YN_Z1F1X0AC -> ../../sdh

Ok.  Be sure to recheck this list any time you boot, since the device
order matters.

> I am a bit more relaxed now because I found that a scheduled transfer of
> the data to the university tape robot had completed before christmas. So
> this local archive mirror is (luckily) not critical. I still want to
> understand whether all this is just a result of shaky hardware, or an
> mdadm (misuse) issue. Losing (all superblocks on) five drives in a large
> software raid 6 instead of bytes is not something I would like to repeat
> any time soon by ie. mishandling mdadm.

I think you skated over the edge due to a flaky motherboard.  mdadm
can't fix that.  In fact, since you have a backup, I personally wouldn't
bother further reconstruction efforts.  If you have a recent
vgcfgbackup, it's doable, but I have little confidence in the device
order: [a????fg], probably [abcdefg].  There's 4! == 24 permutations
there, each of which will require a vgcfgrestore before you can check
the reconstruction with "fsck -n".

> We have then
> 
> Wed Jul 31 18:24:38 2013 on sdf-h2 for creation of the raid6 and
> wed Jul 31 18:?7:19 2013 for creation of the lvm group
> 
> could well be.

I don't see any way to get such a timestamp except "certainly was".

> So I will move the disks to the new server, make 1:1 copies to new
> drives and then attempt an assembly using --assume-clean in which
> order ?

All permutations of [a????fg] with b, c, d, and e.

Try likely combinations gleaned from "mdadm -E" reports first to
shortcut the process.

> Thanks so much, I have learned a lot already.

You are welcome, and good luck.

Regards,

Phil

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html