Re: RAID5 failure and consequent ext4 problems

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Thanks for reaching out, first of all. Apologies for the late reply,
the brilliant (...) spam filter strikes again...

On Thu, Sep 8, 2022 at 1:23 PM Phil Turmel <philip@xxxxxxxxxx> wrote:
> No, the moment of stupid was that you re-created the array.
> Simultaneous multi-drive failures that stop an array are easily fixed
> with --assemble --force.  Too late for that now.
Noted for the future, thanks.

> It is absurdly easy to screw up device order when re-creating, and if
> you didn't specify every allocation and layout detail, the changes in
> defaults over the years would also screw up your data.  And finally,
> omitting --assume-clean would cause all of your parity to be
> recalculated immediately, with catastrophic results if any order or
> allocation attributes are wrong.
Of course. Which is why I specified everything and why I checked the
details with --examine and --detail and they match exactly, minus the
metadata version because, well, I wasn't actually the one typing (it's
a slightly complicated story.. I was reassembling by proxy on the
phone) and I made an incorrect assumption about the person typing.
There aren't, in the end, THAT many things to specify: RAID level,
number of drives, order thereof, chunk size, 'layout' and metadata
version. 0.90 doesn't allow before/after gaps so that should be it, I
believe.
Am I missing anything?

> No, you just got lucky in the past.  Probably by using mdadm versions
> that hadn't been updated.
That's not quite it: I keep records of how arrays are built and match
them, though it is true that I tend to update things as little as
possible on production machines.
One of the differences, this time, is that this was NOT a production
machine. The other was that I was driving, dictating on the phone and
was under a lot of pressure to get the thing back up ASAP.
Nonetheless, I have an --examine of at least two drives from the
previous setup so there should be enough information there to rebuild
a matching array, I think?

> You'll need to show us every command you tried from your history, and
> full details of all drives/partitions involved.
>
> But I'll be brutally honest:  your data is likely toast.
Well, let's hope it isn't. All mdadm commands were -o and
--assume-clean, so in theory the only thing which HAS been written are
the md blocks, unless I am mistaken and/or I read the docs
incorrectly?

That does, of course, leave the problem of the blocks overwritten by
the 1.2 metadata, but as I read the docs that should be a very small
number - let's say one 4096byte block (a portion thereof, to be
pedantic, but ext4 doesn't really care?) per drive, correct?

Background:
Separate 2x SSD RAID 1 root (/dev/sda. /dev/sdb) on the MB (Supemicro
X10 series)'s chipset SATA ports.
All filesystems are ext4, data=journal, nodelalloc, the 'data' RAIDs
have journals on another SSD RAID1 (one per FS, obviously).
Data drives:
12 x 4'TB' Seagate drives, NC000n variety, on 2x LSI 2308 controllers,
each with two four-drive ports (and one of these went DELIGHTFULLY
missing)

This is the layout of each drive:
---
GPT fdisk (gdisk) version 1.0.6
...
Found valid GPT with protective MBR; using GPT.
Disk /dev/sdc: 7814037168 sectors, 3.6 TiB
Model: ST4000NC001-1FS1
Sector size (logical/physical): 512/4096 bytes
...
Total free space is 99949 sectors (48.8 MiB)

Number  Start (sector)    End (sector)  Size       Code  Name
   1            2048      7625195519   3.5 TiB     8300  Linux RAID volume
   2      7625195520      7813939199   90.0 GiB    8300  Linux RAID backup
---

So there were two RAID arrays. Both RAID5 - a main RAID called
'archive' which had the 12 x 3.5ish partitions sdx1 and a second array
called backup which had 12 x 90 GB.

A little further backstory: right before the event, one drive had been
pulled because it had started failing. What I did was shut down the
machine, put the failing drive on a MB port and put a new drive on the
LSI controllers. I then brought the machine back online, did the
--replace --with thing and this worked fine.
At that point the faulty drive (/dev/sdc, MB drives come before the
LSI drives in the count) got deleted via /sys/block.... and physically
disconnected from the system, which was then happily running with
/dev/sda and /dev/sdb as the root RAID SSDs and drives sdd -> sdo as
the 'archive' drives.
It went 96 hours or so like that under moderate load. Then the failure
happened, the machine was rebooted thus the previous sdd -> sdo drives
became sdc -> sdn drives.
However, the relative order was, to the best of my knowledge,
conserved - AND I still have the 'faulty' drive, so I could very
easily put it back in to have everything match.
Most importantly, this drive has on it, without a doubt, the details
of the array BEFORE everything happened - by definition untouched
because the drive was stopped and pulled before the event.
I also have a cat of the --examine of two of the faulty drives BEFORE
anything was written to them - thus, unless I am mistaken, these
contained the md block details from 'before the event'.

Here is one of them, taken after the reboot and therefore when the MB
/dev/sdc was no longer there:
---
/dev/sdc1:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : 2457b506:85728e9d:c44c77eb:7ee19756
  Creation Time : Sat Mar 30 18:18:00 2019
     Raid Level : raid5
  Used Dev Size : -482370688 (3635.98 GiB 3904.10 GB)
     Array Size : 41938562688 (39995.73 GiB 42945.09 GB)
   Raid Devices : 12
  Total Devices : 12
Preferred Minor : 123

    Update Time : Tue Sep  6 11:37:53 2022
          State : clean
 Active Devices : 12
Working Devices : 12
 Failed Devices : 0
  Spare Devices : 0
       Checksum : 391e325d - correct
         Events : 52177

         Layout : left-symmetric
     Chunk Size : 128K

      Number   Major   Minor   RaidDevice State
this     5       8       49        5      active sync   /dev/sdd1

   0     0       8      225        0      active sync
   1     1       8       81        1      active sync   /dev/sdf1
   2     2       8       97        2      active sync   /dev/sdg1
   3     3       8      161        3      active sync   /dev/sdk1
   4     4       8      113        4      active sync   /dev/sdh1
   5     5       8       49        5      active sync   /dev/sdd1
   6     6       8      177        6      active sync   /dev/sdl1
   7     7       8      145        7      active sync   /dev/sdj1
   8     8       8      129        8      active sync   /dev/sdi1
   9     9       8       65        9      active sync   /dev/sde1
  10    10       8      209       10      active sync   /dev/sdn1
  11    11       8      193       11      active sync   /dev/sdm1
---
Note that the drives are 'moved' because the old /dev/sdc isn't there
any more but the relative position should be the same, correct me if I
am wrong. If you prefer, what you need to do to get the 'new' drive
letter is to take 16 out of the minor of each of the drives.

This is the 'new' --create
---
/dev/sdc1:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : 79990944:0bb9420b:97d5a417:7d4e9ef8 (local to host beehive)
  Creation Time : Tue Sep  6 15:15:03 2022
     Raid Level : raid5
  Used Dev Size : -482370688 (3635.98 GiB 3904.10 GB)
     Array Size : 41938562688 (39995.73 GiB 42945.09 GB)
   Raid Devices : 12
  Total Devices : 12
Preferred Minor : 123

    Update Time : Tue Sep  6 15:15:03 2022
          State : clean
 Active Devices : 12
Working Devices : 12
 Failed Devices : 0
  Spare Devices : 0
       Checksum : ed12b96a - correct
         Events : 1

         Layout : left-symmetric
     Chunk Size : 128K

      Number   Major   Minor   RaidDevice State
this     5       8       33        5      active sync   /dev/sdc1

   0     0       8      209        0      active sync   /dev/sdn1
   1     1       8       65        1      active sync   /dev/sde1
   2     2       8       81        2      active sync   /dev/sdf1
   3     3       8      145        3      active sync   /dev/sdj1
   4     4       8       97        4      active sync   /dev/sdg1
   5     5       8       33        5      active sync   /dev/sdc1
   6     6       8      161        6      active sync   /dev/sdk1
   7     7       8      129        7      active sync   /dev/sdi1
   8     8       8      113        8      active sync   /dev/sdh1
   9     9       8       49        9      active sync   /dev/sdd1
  10    10       8      193       10      active sync   /dev/sdm1
  11    11       8      177       11      active sync   /dev/sdl1
---

If you put the layout lines side by side, it would seem to me that
they match, modulo the '16' difference.

This is the list of --create and --assemble commands from the 6th
which involve the sdx1 partitions, those we care about right now -
there were others involving /dev/md124 and the /dev/sdx2 which however
are not relevant - the data there :
--
 9813  mdadm --assemble /dev/md123 missing
 9814  mdadm --assemble /dev/md123 missing /dev/sdf1 /dev/sdg1
/dev/sdk1 /dev/sdh1 /dev/sdd1 /dev/sdl1 /dev/sdj1 /dev/sdi1 /dev/sde1
/dev/sdn1 /dev/sdm1
 9815  mdadm --assemble /dev/md123 /dev/sdf1 /dev/sdg1 /dev/sdk1
/dev/sdh1 /dev/sdd1 /dev/sdl1 /dev/sdj1 /dev/sdi1 /dev/sde1 /dev/sdn1
/dev/sdm1
 9823  mdadm --create -o -n 12 -l 5 /dev/md124 missing /dev/sde1
/dev/sdf1 /dev/sdj1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdd1
/dev/sdm1 /dev/sdl1
 9824  mdadm --create -o -n 12 -l 5 /dev/md124 missing /dev/sde1
/dev/sdf1 /dev/sdj1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1
/dev/sdd1 /dev/sdm1 /dev/sdl1
^^^^ note that these were the WRONG ARRAY - this was an unfortunate
miscommunication which caused potential damage.
 9852  mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90
--chunk=128 /dev/md123 /dev/sdn1 /dev/sdd1 /dev/sdf1 /dev/sde1
/dev/sdg1 /dev/sdj1 /dev/sdi1 /dev/sdm1 /dev/sdh1 /dev/sdk1 /dev/sdl1
 9863  mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90
--chunk=128 /dev/md123 /dev/sdn1 /dev/sdc1 /dev/sdd1 /dev/sdf1
/dev/sde1 /dev/sdg1 /dev/sdj1 /dev/sdi1 /dev/sdm1 /dev/sdh1 /dev/sdk1
/dev/sdl1
 9879  mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90
--chunk=128 --bitmap=none /dev/md123 /dev/sdn1 /dev/sdc1 /dev/sdd1
/dev/sdf1 /dev/sde1 /dev/sdg1 /dev/sdj1 /dev/sdi1 /dev/sdm1 /dev/sdh1
/dev/sdk1 /dev/sdl1
 9889  mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90
--chunk=128 --bitmap=none /dev/md123 /dev/sdn1 /dev/sde1 /dev/sdf1
/dev/sdl1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1 /dev/sdd1
/dev/sdm1 /dev/sdl1
 9892  mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90
--chunk=128 --bitmap=none /dev/md123 /dev/sdn1 /dev/sde1 /dev/sdf1
/dev/sdl1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1 /dev/sdd1
/dev/sdm1 /dev/sdl1
 9895  mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90
--chunk=128 --bitmap=none /dev/md123 /dev/sdn1 /dev/sde1 /dev/sdf1
/dev/sdj1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1 /dev/sdd1
/dev/sdm1 /dev/sdl1
 9901  mdadm --assemble /dev/md123 /dev/sdn1 /dev/sde1 /dev/sdf1
/dev/sdl1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1 /dev/sdd1
/dev/sdm1 /dev/sdl1
 9903  mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90
--chunk=128 --bitmap=none /dev/md123 /dev/sdn1 /dev/sde1 /dev/sdf1
/dev/sdj1 /dev/sdg1 / dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1 /dev/sdd1
/dev/sdm1 /dev/sdl1
---

Note that they all were -o, therefore if I am not mistaken no parity
data was written anywhere. Note further the fact that the first two
were the 'mistake' ones, which did NOT have --assume-clean (but with
-o this shouldn't make a difference AFAIK) and most importantly the
metadata was the 1.2 default AND they were the wrong array in the
first place.
Note also that the 'final' --create commands also had --bitmap=none to
match the original array, though according to the docs the bitmap
space in 0.90 (and 1.2?) is in a space which does not affect the data
in the first place.

Now, first of all a question: if I get the 'old' sdc, the one that was
taken out prior to this whole mess, onto a different system in order
to examine it, the modern mdraid auto discovery shoud NOT overwrite
the md data, correct? Thus I should be able to double-check the drive
order on that as well?

Any other pointers, insults etc are of course welcome.



[Index of Archives]     [Linux RAID Wiki]     [ATA RAID]     [Linux SCSI Target Infrastructure]     [Linux Block]     [Linux IDE]     [Linux SCSI]     [Linux Hams]     [Device Mapper]     [Device Mapper Cryptographics]     [Kernel]     [Linux Admin]     [Linux Net]     [GFS]     [RPM]     [git]     [Yosemite Forum]


  Powered by Linux