Re: Requesting help recovering my array

Roger Heflin <rogerheflin@xxxxxxxxx> · Thu, 25 Jan 2024 17:00:16 -0600

looking further gpt may simply clear all it could use.

So you may need to look at the first 16k and clear on the overlay that
16k before the test.

On Thu, Jan 25, 2024 at 4:53 PM Roger Heflin <rogerheflin@xxxxxxxxx> wrote:
>
> and given the partition table probably eliminated the disk superblocks
> step #1 would be to overlay sdX (not sdx1) and remove the partition
> from the overlay and begin testing.
>
> The first test may be simply the dd if=/dev/zero of=overlaydisk bs=256
> count=8 as looking at the gpt partitions I have says that will
> eliminate that header and then try an --examine and see if that finds
> anything, if that works you won't need to go into the assume clean
> stuff which simplifies everything.
>
> My checking says the gpt partition table seems to start 256bytes in
> and stops before about 2k in before hex 1000(4k) where the md header
> seems to be located on mine.
>
>
>
> On Thu, Jan 25, 2024 at 4:37 PM Roger Heflin <rogerheflin@xxxxxxxxx> wrote:
> >
> > If the one that is working right does not have a partition then
> > someway the partitioning got added teh the broken disks.  The
> > partition is a gpt partition which another indicated is about 16k
> > which would mean it overwrote the md header at 4k, and that header
> > being overwritten would cause the disk to no longer be raid members.
> >
> > The create must have the disks in the correct order and with the
> > correct parameters,     Doing a random create with a random order is
> > unlikely to work, and may well make things unrecoverable.
> >
> > I believe there are instructions on some page about md repairing that
> > talks about using overlays.   Using the overlays lets you create an
> > overlay such that the underlying devices aren't written to such that
> > you can test a number of different orders and parameters to find the
> > one that works.
> >
> > I think this is the stuff about recovering and overlays that you want to follow.
> >
> > https://raid.wiki.kernel.org/index.php/Irreversible_mdadm_failure_recovery
> >
> >
> > On Thu, Jan 25, 2024 at 12:41 PM RJ Marquette <rjm1@xxxxxxxxx> wrote:
> > >
> > > No, this system does not have any other OS's installed on it, Debian Linux only, as it's my server.
> > >
> > > No, the three drives remained connected to the extra controller card and were never removed from that card - I just pulled the card out of the case with the connections intact, and swung it off to the side.  In fact they still haven't been removed.
> > >
> > > I don't understand the partitions comment, as 5 of the 6 drives do appear to have separate partitions for the data, and the one that doesn't is the only that seems to be responding normally.  I guess the theory is that whatever damaged the partition tables wrote a single primary partition to the drive in the process?
> > >
> > > I do not know what caused this problem.  I've had no reason to run fdisk or any similar utility on that computer in years.  I know we want to figure out why this happened, but I'd also like to recover my RAID, if possible.
> > >
> > > What are my options at this point?  Should I try something like this?  (This is for someone's RAID1 setup, obviously the level and drives would change for me.):
> > >
> > > mdadm --create --assume-clean --level=1 --raid-devices=2 /dev/md0 /dev/sda /dev/sdb
> > >
> > > That's from this page:  https://askubuntu.com/questions/1254561/md-raid-superblock-gets-deleted
> > >
> > > I'm currently running testdisk on one of the affected drives to see what that turns up.
> > >
> > > Thanks.
> > > --RJ
> > >
> > > On Thursday, January 25, 2024 at 12:43:58 PM EST, Roger Heflin <rogerheflin@xxxxxxxxx> wrote:
> > >
> > >
> > >
> > >
> > >
> > > You never booted windows or any other non-linux boot image that might
> > > have decided to "fix" the disk's missing partition tables?
> > >
> > > And when messing with the install did you move the disks around so
> > > that some of the disk could have been on the intel controller with
> > > raid set at different times?
> > >
> > > That specific model of marvell controller does not list support raid,
> > > but other models in the same family do, so it may also have an option
> > > in the bios that "fixes" the partition table.
> > >
> > > Any number of id10t's writing tools may have wrongly decided that a
> > > disk without a partition table needs to be fixed.  I know the windows
> > > disk management used to (may still) complain about no partitions and
> > > prompt to "fix" it.
> > >
> > > I always run partitions on everything.  I have had the partition save
> > > me when 2 different vendors hardware raid controllers lost their
> > > config (random crash, freaked on on fw upgrade) and when the config
> > > was recreated seem to "helpfully" clear a few kb at the front of the
> > > disk.  Rescue boot, repartition, mount os lv, and reinstall grub
> > > fixed those.
> > >
> > > On Thu, Jan 25, 2024 at 9:17 AM RJ Marquette <rjm1@xxxxxxxxx> wrote:
> > > >
> > > > It's an ext4 RAID5 array.  No LVM, LUKS, etc.
> > > >
> > > > You make a good point about the BIOS explanation - it seems to have affected only the 5 RAID drives that had data on them, not the spare, nor the other system drive (and the latter two are both connected to the motherboard).  How would it have decided to grab exactly those 5?
> > > >
> > > > Thanks.
> > > > --RJ
> > > >
> > > >
> > > > On Thursday, January 25, 2024 at 10:01:40 AM EST, Pascal Hambourg <pascal@xxxxxxxxxxxxxxx> wrote:
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On 25/01/2024 at 12:49, RJ Marquette wrote:
> > > > > root@jackie:/home/rj# /sbin/fdisk -l /dev/sdb
> > > > > Disk /dev/sdb: 2.73 TiB, 3000592982016 bytes, 5860533168 sectors
> > > > > Disk model: Hitachi HUS72403
> > > > > Units: sectors of 1 * 512 = 512 bytes
> > > > > Sector size (logical/physical): 512 bytes / 4096 bytes
> > > > > I/O size (minimum/optimal): 4096 bytes / 4096 bytes
> > > > > Disklabel type: gpt
> > > > > Disk identifier: AF5DC5DE-1404-4F4F-85AF-B5574CD9C627
> > > > >
> > > > > Device    Start        End    Sectors  Size Type
> > > > > /dev/sdb1  2048 5860532223 5860530176  2.7T Microsoft basic data
> > > > >
> > > > > root@jackie:/home/rj# cat /sys/block/sdb/sdb1/start
> > > > > 2048
> > > > > root@jackie:/home/rj# cat /sys/block/sdb/sdb1/size
> > > > > 5860530176
> > > >
> > > > The partition geometry looks correct, with standard alignment.
> > > > And the kernel view of the partition matches the partition table.
> > > > The partition type "Microsoft basic data" is neither "Linux RAID" nor
> > > > the default type "Linux flesystem" set by usual GNU/Linux partitioning
> > > > tools such as fdisk, parted and gdisk so it seems unlikely that the
> > > > partition was created with one of these tools.
> > > >
> > > >
> > > > >>> It looks like this is what happened after all.  I searched for "MBR
> > > > >>> Magic aa55" and found someone else with the same issue long ago:
> > > > >>> https://serverfault.com/questions/580761/is-mdadm-raid-toast  Looks like
> > > > >>> his was caused by a RAID configuration option in BIOS.  I recall seeing
> > > > >>> that on mine; I must have activated it by accident when setting the boot
> > > > >>> drive or something.
> > > >
> > > >
> > > > I am a bit suspicious about this cause for two reasons:
> > > > - sde, sdf and sdg are affected even though they are connected to the
> > > > add-on Marvell SATA controller card which is supposed to be outside the
> > > > motherboard RAID scope;
> > > > - sdc is not affected even though it is connected to the onboard Intel
> > > > SATA controller.
> > > >
> > > > What was contents type of the RAID array ? LVM, LUKS, plain filesystem ?
> > > >
> > >