Re: Requesting help recovering my array

RJ Marquette <rjm1@xxxxxxxxx> · Fri, 26 Jan 2024 23:45:45 +0000 (UTC)

Quick follow up:  When I rebooted, the partition tables got munged again.  Definitely a BIOS issue.  I have a 10TB drive on order, so I'll copy everything off, then rebuild the array in the recommended format (though one wonders if I even need an array when a single drive can hold everything...) with partitions, and see what happens then.

Thanks.
--RJ

On Friday, January 26, 2024 at 11:03:38 AM EST, RJ Marquette <rjm1@xxxxxxxxx> wrote: 

HOLY...  I GOT IT!  WOW.

I guess all I needed to do was abandon hope.

The magic spell that worked (sorry, I know it's not magic to the developers on this list, but it definitely feels that way to me right now):

mdadm --create /dev/md0 --level=5 --chunk=512K --data-offset=262144s --raid-devices=5 /dev/sde /dev/sdf /dev/sdg /dev/sda /dev/sdb --assume-clean --readonly

I'm copying the stuff I thought I lost now, while it's still accessible in RO mode!

The previous command I tried swapped sdb and sda, and I got a different error when I tried to mount it.  The previous attempts all gave me the "bad superblock" error upon mount attempt, but that attempt said something like filesystem errors.  So I made a note of it, then tried the one above, and it mounted cleanly.  No concerns at all.

(Sorry, I'm sure this is normally a very stoic list, but, I'm sure you understand my excitement here.)

I assume, once I'm comfortable with my backups, I can unmount it, stop it, then use mdadm --assemble to load it again with read/write access. I've already updated the mdadm.conf with the new uuid, so I think it should work automatically then.

Yes, I will review my backup plan and make improvements.  The plan was good, in that I wasn't going to lose most of the critical stuff, but not great, in that I was going to lose some.

(And, yes, I still wonder why it happened in the first place, so I will keep that in mind.  My guess is that if the array survives the next reboot, it'll probably be fine for a long time.  If it doesn't, then it's clearly something in the motherboard trashing it.  Not great, but at least now I'll know how to resurrect it when I swap in the previous motherboard.)

THANK YOU EVERYONE!

--RJ

On Friday, January 26, 2024 at 10:15:16 AM EST, RJ Marquette <rjm1@xxxxxxxxx> wrote: 

Thanks.

I'm not following all of what you're saying, and I suspect I'm beyond the point where it's going to help.  I tried the instructions on the page you linked, but I haven't hit the right combination of parameters and drive order to get a valid ext4 partition.  Sigh.  I will admit I couldn't get the overlay working, and figuring I was already likely not going to be able to recreate it, I went for it directly on the drives - still using readonly and assume clean, but of course it overwrites what was left of the partition tables - which I figured was no big loss at this point.  It didn't sound like my chances of recovery were very high anyway.

Basically I'm trying variations on this command - removing offset, using sd[x] instead of sd[x]1, changing the order of a and b, etc.  Would it be critical the spare drive be included?  I can't see why it would, so I've been ignoring it during these tests.  (The chunk and offset came from the info on the spare drive.)

mdadm --create /dev/md0 --level=5 --chunk=512K --data-offset=262144s --raid-devices=5 /dev/sda1 /dev/sdb1 /dev/sde1 /dev/sdf1 /dev/sdg1 --assume-clean --readonly

I really do wish I knew what happened to these drives (and I'm sure this group would, too).  I'm pretty sure it was something in the BIOS that caused it; I've seen a few reports of motherboards from this era (~2015) with bugs in UEFI in other brands that caused issues like this, but I couldn't find anything specific to ASUS.  Someone mentioned the possibility of something happening while it was running that would have cropped up when I rebooted (even on the old board), but it hadn't been that long since my previous reboot (~30 days IIRC), and as I mentioned I've had no cause to modify anything related to the system, aside from updating the software using apt.

After I recreate the array, I'll throw some data on it, and then reboot the computer to see what happens.  It'll probably be fine, but if it's munged again...well...that will certainly be an interesting outcome, to say the least.

Thanks for the suggestions, everyone.  I'm not planning to overwrite the array right this moment, so if you have other suggestions on things to try, I'm open to them.

Other thoughts as I process this:

I dunno.  At one point a few weeks ago, my /var partition (which isn't on the array) filled up, so that caused some weird issues with various things until I figured out what was going on.  I doubt that was a factor here, and the array was working just fine after the cleanup.

I do have backups of most of the stuff that was on the array - I will lose a bunch of our ripped DVDs and Blu-Rays, which is a headache to recreate, but not truly lost.  I believe I have copies of all of our recent pictures on my laptop or desktop machine; older stuff is stored on Amazon Glacier, if needed, but I think I have a local copy of most of it.  

I see a few groups of pictures I may have lost completely.  They were too new for being uploaded to Glacier, but too old to still be on my desktop or laptop.  I don't seem to have posted them online, either.  (I'm checking my Glacier inventory now to see if I did upload them at some point, but it's unlikely.)

I did lose some data for a hobby project, which is not at all critical or important in the grand scheme of things, but it is a bummer.  I've even been thinking about how I should back up that data.  Fortunately, I always enter the most important information into a database when new data came in, so I at least have that.

I used to have a script that backed up the array to a 3TB drive in my desktop machine, but as 7TB (used space in the array) is greater than 3TB, there was an obvious issue developing, so I stopped it a few years back, instead of paring down the stuff copied over.  Dang.

Thanks.
--RJ

On Thursday, January 25, 2024 at 06:01:18 PM EST, Roger Heflin <rogerheflin@xxxxxxxxx> wrote: 

looking further gpt may simply clear all it could use.

So you may need to look at the first 16k and clear on the overlay that
16k before the test.

On Thu, Jan 25, 2024 at 4:53 PM Roger Heflin <rogerheflin@xxxxxxxxx> wrote:
>
> and given the partition table probably eliminated the disk superblocks
> step #1 would be to overlay sdX (not sdx1) and remove the partition
> from the overlay and begin testing.
>
> The first test may be simply the dd if=/dev/zero of=overlaydisk bs=256
> count=8 as looking at the gpt partitions I have says that will
> eliminate that header and then try an --examine and see if that finds
> anything, if that works you won't need to go into the assume clean
> stuff which simplifies everything.
>
> My checking says the gpt partition table seems to start 256bytes in
> and stops before about 2k in before hex 1000(4k) where the md header
> seems to be located on mine.
>
>
>
> On Thu, Jan 25, 2024 at 4:37 PM Roger Heflin <rogerheflin@xxxxxxxxx> wrote:
> >
> > If the one that is working right does not have a partition then
> > someway the partitioning got added teh the broken disks.  The
> > partition is a gpt partition which another indicated is about 16k
> > which would mean it overwrote the md header at 4k, and that header
> > being overwritten would cause the disk to no longer be raid members.
> >
> > The create must have the disks in the correct order and with the
> > correct parameters,    Doing a random create with a random order is
> > unlikely to work, and may well make things unrecoverable.
> >
> > I believe there are instructions on some page about md repairing that
> > talks about using overlays.  Using the overlays lets you create an
> > overlay such that the underlying devices aren't written to such that
> > you can test a number of different orders and parameters to find the
> > one that works.
> >
> > I think this is the stuff about recovering and overlays that you want to follow.
> >
> > https://raid.wiki.kernel.org/index.php/Irreversible_mdadm_failure_recovery
> >
> >
> > On Thu, Jan 25, 2024 at 12:41 PM RJ Marquette <rjm1@xxxxxxxxx> wrote:
> > >
> > > No, this system does not have any other OS's installed on it, Debian Linux only, as it's my server.
> > >
> > > No, the three drives remained connected to the extra controller card and were never removed from that card - I just pulled the card out of the case with the connections intact, and swung it off to the side.  In fact they still haven't been removed.
> > >
> > > I don't understand the partitions comment, as 5 of the 6 drives do appear to have separate partitions for the data, and the one that doesn't is the only that seems to be responding normally.  I guess the theory is that whatever damaged the partition tables wrote a single primary partition to the drive in the process?
> > >
> > > I do not know what caused this problem.  I've had no reason to run fdisk or any similar utility on that computer in years.  I know we want to figure out why this happened, but I'd also like to recover my RAID, if possible.
> > >
> > > What are my options at this point?  Should I try something like this?  (This is for someone's RAID1 setup, obviously the level and drives would change for me.):
> > >
> > > mdadm --create --assume-clean --level=1 --raid-devices=2 /dev/md0 /dev/sda /dev/sdb
> > >
> > > That's from this page:  https://askubuntu.com/questions/1254561/md-raid-superblock-gets-deleted
> > >
> > > I'm currently running testdisk on one of the affected drives to see what that turns up.
> > >
> > > Thanks.
> > > --RJ
> > >
> > > On Thursday, January 25, 2024 at 12:43:58 PM EST, Roger Heflin <rogerheflin@xxxxxxxxx> wrote:
> > >
> > >
> > >
> > >
> > >
> > > You never booted windows or any other non-linux boot image that might
> > > have decided to "fix" the disk's missing partition tables?
> > >
> > > And when messing with the install did you move the disks around so
> > > that some of the disk could have been on the intel controller with
> > > raid set at different times?
> > >
> > > That specific model of marvell controller does not list support raid,
> > > but other models in the same family do, so it may also have an option
> > > in the bios that "fixes" the partition table.
> > >
> > > Any number of id10t's writing tools may have wrongly decided that a
> > > disk without a partition table needs to be fixed.  I know the windows
> > > disk management used to (may still) complain about no partitions and
> > > prompt to "fix" it.
> > >
> > > I always run partitions on everything.  I have had the partition save
> > > me when 2 different vendors hardware raid controllers lost their
> > > config (random crash, freaked on on fw upgrade) and when the config
> > > was recreated seem to "helpfully" clear a few kb at the front of the
> > > disk.  Rescue boot, repartition, mount os lv, and reinstall grub
> > > fixed those.
> > >
> > > On Thu, Jan 25, 2024 at 9:17 AM RJ Marquette <rjm1@xxxxxxxxx> wrote:
> > > >
> > > > It's an ext4 RAID5 array.  No LVM, LUKS, etc.
> > > >
> > > > You make a good point about the BIOS explanation - it seems to have affected only the 5 RAID drives that had data on them, not the spare, nor the other system drive (and the latter two are both connected to the motherboard).  How would it have decided to grab exactly those 5?
> > > >
> > > > Thanks.
> > > > --RJ
> > > >
> > > >
> > > > On Thursday, January 25, 2024 at 10:01:40 AM EST, Pascal Hambourg <pascal@xxxxxxxxxxxxxxx> wrote:
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On 25/01/2024 at 12:49, RJ Marquette wrote:
> > > > > root@jackie:/home/rj# /sbin/fdisk -l /dev/sdb
> > > > > Disk /dev/sdb: 2.73 TiB, 3000592982016 bytes, 5860533168 sectors
> > > > > Disk model: Hitachi HUS72403
> > > > > Units: sectors of 1 * 512 = 512 bytes
> > > > > Sector size (logical/physical): 512 bytes / 4096 bytes
> > > > > I/O size (minimum/optimal): 4096 bytes / 4096 bytes
> > > > > Disklabel type: gpt
> > > > > Disk identifier: AF5DC5DE-1404-4F4F-85AF-B5574CD9C627
> > > > >
> > > > > Device    Start        End    Sectors  Size Type
> > > > > /dev/sdb1  2048 5860532223 5860530176  2.7T Microsoft basic data
> > > > >
> > > > > root@jackie:/home/rj# cat /sys/block/sdb/sdb1/start
> > > > > 2048
> > > > > root@jackie:/home/rj# cat /sys/block/sdb/sdb1/size
> > > > > 5860530176
> > > >
> > > > The partition geometry looks correct, with standard alignment.
> > > > And the kernel view of the partition matches the partition table.
> > > > The partition type "Microsoft basic data" is neither "Linux RAID" nor
> > > > the default type "Linux flesystem" set by usual GNU/Linux partitioning
> > > > tools such as fdisk, parted and gdisk so it seems unlikely that the
> > > > partition was created with one of these tools.
> > > >
> > > >
> > > > >>> It looks like this is what happened after all.  I searched for "MBR
> > > > >>> Magic aa55" and found someone else with the same issue long ago:
> > > > >>> https://serverfault.com/questions/580761/is-mdadm-raid-toast ; Looks like
> > > > >>> his was caused by a RAID configuration option in BIOS.  I recall seeing
> > > > >>> that on mine; I must have activated it by accident when setting the boot
> > > > >>> drive or something.
> > > >
> > > >
> > > > I am a bit suspicious about this cause for two reasons:
> > > > - sde, sdf and sdg are affected even though they are connected to the
> > > > add-on Marvell SATA controller card which is supposed to be outside the
> > > > motherboard RAID scope;
> > > > - sdc is not affected even though it is connected to the onboard Intel
> > > > SATA controller.
> > > >
> > > > What was contents type of the RAID array ? LVM, LUKS, plain filesystem ?
> > > >
> > >