Re: Linux software RAID assistance

Phil Turmel <philip@xxxxxxxxxx> · Tue, 15 Feb 2011 09:51:31 -0500

Hi Neil,

Since Simon has responded, let me summarize the assistance I provided per his off-list request:

On 02/14/2011 11:53 PM, NeilBrown wrote:
> On Thu, 10 Feb 2011 16:16:44 +0000 Simon McNair <simonmcnair@xxxxxxxxx> wrote:
> 
>>
>> Hi all
>>
>> I use a 3ware 9500-12 port sata card (JBOD) which will not work without a
>> 128mb sodimm.  The sodimm socket is flakey and the result is that the
>> machine occasionally crashes.  Yesterday I finally gave in and put 
>> together another
>> machine so that I can rsync between them.  When I turned the machine
>> on today to set up rync, the RAID array was not gone, but corrupted. 
>>   Typical...
> 
> Presumably the old machine was called 'ubuntu' and the new machine 'proÃlox'
> 
> 
>>
>> I built the array in Aug 2010 using the following command:
>>
>> mdadm --create --verbose /dev/md0 --metadata=1.1 --level=5
>> --raid-devices=10 /dev/sd{b,c,d,e,f,g,h,i,j,k}1 --chunk=64
>>
>> Using LVM, I did the following:
>> pvscan
>> pvcreate -M2 /dev/md0
>> vgcreate lvm-raid /dev/md0
>> vgdisplay lvm-raid
>> vgscan
>> lvscan
>> lvcreate -v -l 100%VG -n RAID lvm-raid
>> lvdisplay /dev/lvm-raid/lvm0
>>
>> I then formatted using:
>> mkfs -t ext4 -v -m .1 -b 4096 -E stride=16,stripe-width=144 
>> /dev/lvm-raid/RAID
>>
>> This worked perfectly since I created the array.  Now mdadm is coming up 
>> with
>>
>> proxmox:/dev/md# mdadm --assemble --scan --verbose
>> mdadm: looking for devices for further assembly
>> mdadm: no recogniseable superblock on /dev/md/ubuntu:0
> 
> And it seems that ubuntu:0 have been successfully assembled.
> It is missing one device for some reason (sdd1) but RAID can cope with that.

3ware card is compromised, with a loose buffer memory dimm.  Some of its ECC errors were caught and reported in dmesg.  Its likely, based on the loose memory socket, that many multiple-bit errors got through.

[trim /]

>> mdadm: no uptodate device for slot 8 of /dev/md/proïlox:0
>> mdadm: no uptodate device for slot 9 of /dev/md/proïlox:0
>> mdadm: failed to add /dev/sdd1 to /dev/md/proïlox:0: Invalid argument
>> mdadm: /dev/md/proïlox:0 assembled from 0 drives - not enough to start
>> the array.
> 
> This looks like it is *after* to trying the --create command you give
> below..  It is best to report things in the order they happen, else you can
> confuse people (or get caught out!).

Yes, this was after.

>> mdadm: looking for devices for further assembly
>> mdadm: no recogniseable superblock on /dev/sdd
>> mdadm: No arrays found in config file or automatically
>>
>> pvscan and vgscan show nothing.
>>
>> So I tried running mdadm --create --verbose /dev/md0 --metadata=1.1
>> --level=5 --raid-devices=10 missing /dev/sde1 /dev/sdf1 /dev/sdg1
>> /dev/sdh1 /dev/sdi1 /dev/sdj1 /dev/sdk1 /dev/sdl1 /dev/sdm1 --chunk=64
>>
>> as it seemed that /dev/sdd1 failed to be added to the array.  This did 
>> nothing.
> 
> It did not to nothing.  It wrote a superblock to /dev/sdd1 and complained
> that it couldn't write to all the others --- didn't it?

There were multiple attempts to create.  One wrote to just sdd1, another succeeded with all but sdd1.

>> dmesg contains:
>>
>> md: invalid superblock checksum on sdd1
> 
> I guess that is why sdd1 was missing from 'ubuntu:0'.  Though as I cannot
> tell if this happened before or after any of the various things reported
> above, it is hard to be sure.
> 
> 
> The  real mystery is why 'pvscan' reports nothing.

The original array was created with mdadm v2.6.7, and had a data offset of 264 sectors.  After Simon's various attempts to --create, he ended up with data offset of 2048, using mdadm v3.1.4.  The mdadm -E reports he posted to the list showed the 264 offset.  We didn't realize the offset had been updated until somewhat later in our troubleshooting efforts.

In any case, pvscan couldn't see the LVM signature because it wasn't there (at offset 2048).

> What about
>   pvscan --verbose
> 
> or
> 
>   blkid -p /dev/md/ubuntu:0
> 
> or even
> 
>   dd of=/dev/md/ubuntu:0 count=8 | od -c 

Fortunately, Simon did have a copy of his LVM configuration.  With the help of dd, strings, and grep, we did locate his LVM sig at the correct location on sdd1 (for data offset 264).  After a number of attempts to bypass LVM and access his single LV with dmsetup (based on his backed up configuration, on the assembled new array less sdd1), I realized that the data offset was wrong on the recreated array, and went looking for the cause.  I found your git commit that changed that logic last spring, and recommended that Simon revert to the default package for his ubuntu install, which is v2.6.7.

Simon has now attempted to recreate the array with v2.6.7, but the controller is throwing too many errors to succeed, and I suggested it was too flakey to trust any further.  Based on the existence of the LVM sig on sdd1, I believe Simon's data is (mostly) intact, and only needs a successful create operation with a properly functioning controller.  (He might also need to perform an lvm vgcfgrestore, but he has the necessary backup file.)

A new controller is on order.

Phil
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html