Re: RAID5 failure and consequent ext4 problems

Luigi Fabio <luigi.fabio@xxxxxxxxx> · Fri, 9 Sep 2022 19:04:43 -0400

A further question, in THIS boot's log I found:
[ 9874.709903] md/raid:md123: raid level 5 active with 12 out of 12
devices, algorithm 2
[ 9874.710249] md123: bitmap file is out of date (0 < 1) -- forcing
full recovery
[ 9874.714178] md123: bitmap file is out of date, doing full recovery
[ 9874.881106] md123: detected capacity change from 0 to 42945088192512
From, I think, the second --create of /dev/123, before I added the
bitmap=none. This should, however, not have written anything with -o
and --assume-clean, correct?

On Fri, Sep 9, 2022 at 6:50 PM Luigi Fabio <luigi.fabio@xxxxxxxxx> wrote:
>
> By different kernels, maybe - but the kernel has been the same for
> quite a while (months).
>
> I did paste the whole of the command lines in the (very long) email,
> as David mentions (thanks!) - the first ones, the mistaken ones, did
> NOT have --assume-clean but they did have -o, so no parity activity
> should have started according to the docs?
> A new thought came to mind: one of the HBAs lost a channel, right?
> What if on the subsequent reboot the devices that were on that channel
> got 'rediscovered' and shunted to the end of the letter order? That
> would, I believe, be ordinary operating procedure.
> That would give us an almost-correct array, which would explain how
> fsck can get ... some pieces.
>
> Also, I am not quite brave enough (...) to use shortcuts when handling
> mdadm commands.
>
> I am reconstructing the port order (scsi targets, if you prefer) from
> the 20220904 boot log. I should at that point be able to have an exact
> order of the drives.
>
> Here it is:
>
> ---
> [    1.853329] sd 2:0:0:0: [sda] Write Protect is off
> [    1.853331] sd 7:0:0:0: [sdc] Write Protect is off
> [    1.853382] sd 3:0:0:0: [sdb] Write Protect is off
> [   12.531607] sd 10:0:3:0: [sdg] Write Protect is off
> [   12.533303] sd 10:0:2:0: [sdf] Write Protect is off
> [   12.534606] sd 10:0:0:0: [sdd] Write Protect is off
> [   12.570768] sd 10:0:1:0: [sde] Write Protect is off
> [   12.959925] sd 11:0:0:0: [sdh] Write Protect is off
> [   12.965230] sd 11:0:1:0: [sdi] Write Protect is off
> [   12.966145] sd 11:0:4:0: [sdl] Write Protect is off
> [   12.966800] sd 11:0:3:0: [sdk] Write Protect is off
> [   12.997253] sd 11:0:2:0: [sdj] Write Protect is off
> [   13.002395] sd 11:0:7:0: [sdo] Write Protect is off
> [   13.012693] sd 11:0:5:0: [sdm] Write Protect is off
> [   13.017630] sd 11:0:6:0: [sdn] Write Protect is off
> ---
> If we combine this with the previous:
> ---
> [   13.528395] md/raid:md123: device sdd1 operational as raid disk 5
> [   13.528396] md/raid:md123: device sde1 operational as raid disk 9
> [   13.528397] md/raid:md123: device sdg1 operational as raid disk 2
> [   13.528398] md/raid:md123: device sdf1 operational as raid disk 1
> [   13.528398] md/raid:md123: device sdh1 operational as raid disk 4
> [   13.528399] md/raid:md123: device sdk1 operational as raid disk 3
> [   13.528400] md/raid:md123: device sdj1 operational as raid disk 7
> [   13.528401] md/raid:md123: device sdn1 operational as raid disk 10
> [   13.528402] md/raid:md123: device sdi1 operational as raid disk 8
> [   13.528402] md/raid:md123: device sdl1 operational as raid disk 6
> [   13.528403] md/raid:md123: device sdm1 operational as raid disk 11
> [   13.528403] md/raid:md123: device sdc1 operational as raid disk 0
> [   13.531613] md/raid:md123: raid level 5 active with 12 out of 12
> devices, algorithm 2
> [   13.531644] md123: detected capacity change from 0 to 42945088192512
> ---
> We have a SCSI target -> raid disk number correspondence.
> As of this boot, the letter -> scsi target correspondences match,
> shifted by one because as discussed 7:0:0:0 is no longer there (the
> old, 'faulty' sdc).
> Thus, having univocally determined the prior scsi target -> raid
> position we can transpose it to the present drive letters, which are
> shifted by one.
> Therefore, we can generate, rectius have generated, a --create with
> the same software versions, the same settings and the same drive
> order. Is there any reason why, minus the 1.2 metadata overwriting
> which should have only affected 12 blocks, the fs should 'not' be as
> before?
> Genuine question, mind.
>
> On Fri, Sep 9, 2022 at 5:48 PM Phil Turmel <philip@xxxxxxxxxx> wrote:
> >
> > Reasonably likely, but not certain.
> >
> > Devices can be re-ordered by different kernels.  That's why lsdrv prints
> > serial numbers in its tree.
> >
> > You haven't mentioned whether your --create operations specified
> > --assume-clean.
> >
> > Also, be aware that shell expansion of something like /dev/sd[dcbaefgh]
> > is sorted to /dev/sd[abcdefgh].  Use curly brace expansion with commas
> > if you are taking shortcuts.
> >
> > On 9/9/22 17:01, Luigi Fabio wrote:
> > > Another helpful datapoint, this is the boot *before* sdc got
> > > --replaced with sdo:
> > >
> > > [   13.528395] md/raid:md123: device sdd1 operational as raid disk 5
> > > [   13.528396] md/raid:md123: device sde1 operational as raid disk 9
> > > [   13.528397] md/raid:md123: device sdg1 operational as raid disk 2
> > > [   13.528398] md/raid:md123: device sdf1 operational as raid disk 1
> > > [   13.528398] md/raid:md123: device sdh1 operational as raid disk 4
> > > [   13.528399] md/raid:md123: device sdk1 operational as raid disk 3
> > > [   13.528400] md/raid:md123: device sdj1 operational as raid disk 7
> > > [   13.528401] md/raid:md123: device sdn1 operational as raid disk 10
> > > [   13.528402] md/raid:md123: device sdi1 operational as raid disk 8
> > > [   13.528402] md/raid:md123: device sdl1 operational as raid disk 6
> > > [   13.528403] md/raid:md123: device sdm1 operational as raid disk 11
> > > [   13.528403] md/raid:md123: device sdc1 operational as raid disk 0
> > > [   13.531613] md/raid:md123: raid level 5 active with 12 out of 12
> > > devices, algorithm 2
> > > [   13.531644] md123: detected capacity change from 0 to 42945088192512
> > >
> > > This gives us, correct me if I am wrong of course, an exact
> > > representation of what the array 'used to look like', with sdc1 then
> > > replaced by sdo1 (8/225).
> > >
> > > Just some confirmation that the order should (?) be the one above.
> > >
> > > LF
> > >
> > > On Fri, Sep 9, 2022 at 4:32 PM Luigi Fabio <luigi.fabio@xxxxxxxxx> wrote:
> > >>
> > >> Thanks for reaching out, first of all. Apologies for the late reply,
> > >> the brilliant (...) spam filter strikes again...
> > >>
> > >> On Thu, Sep 8, 2022 at 1:23 PM Phil Turmel <philip@xxxxxxxxxx> wrote:
> > >>> No, the moment of stupid was that you re-created the array.
> > >>> Simultaneous multi-drive failures that stop an array are easily fixed
> > >>> with --assemble --force.  Too late for that now.
> > >> Noted for the future, thanks.
> > >>
> > >>> It is absurdly easy to screw up device order when re-creating, and if
> > >>> you didn't specify every allocation and layout detail, the changes in
> > >>> defaults over the years would also screw up your data.  And finally,
> > >>> omitting --assume-clean would cause all of your parity to be
> > >>> recalculated immediately, with catastrophic results if any order or
> > >>> allocation attributes are wrong.
> > >> Of course. Which is why I specified everything and why I checked the
> > >> details with --examine and --detail and they match exactly, minus the
> > >> metadata version because, well, I wasn't actually the one typing (it's
> > >> a slightly complicated story.. I was reassembling by proxy on the
> > >> phone) and I made an incorrect assumption about the person typing.
> > >> There aren't, in the end, THAT many things to specify: RAID level,
> > >> number of drives, order thereof, chunk size, 'layout' and metadata
> > >> version. 0.90 doesn't allow before/after gaps so that should be it, I
> > >> believe.
> > >> Am I missing anything?
> > >>
> > >>> No, you just got lucky in the past.  Probably by using mdadm versions
> > >>> that hadn't been updated.
> > >> That's not quite it: I keep records of how arrays are built and match
> > >> them, though it is true that I tend to update things as little as
> > >> possible on production machines.
> > >> One of the differences, this time, is that this was NOT a production
> > >> machine. The other was that I was driving, dictating on the phone and
> > >> was under a lot of pressure to get the thing back up ASAP.
> > >> Nonetheless, I have an --examine of at least two drives from the
> > >> previous setup so there should be enough information there to rebuild
> > >> a matching array, I think?
> > >>
> > >>> You'll need to show us every command you tried from your history, and
> > >>> full details of all drives/partitions involved.
> > >>>
> > >>> But I'll be brutally honest:  your data is likely toast.
> > >> Well, let's hope it isn't. All mdadm commands were -o and
> > >> --assume-clean, so in theory the only thing which HAS been written are
> > >> the md blocks, unless I am mistaken and/or I read the docs
> > >> incorrectly?
> > >>
> > >> That does, of course, leave the problem of the blocks overwritten by
> > >> the 1.2 metadata, but as I read the docs that should be a very small
> > >> number - let's say one 4096byte block (a portion thereof, to be
> > >> pedantic, but ext4 doesn't really care?) per drive, correct?
> > >>
> > >> Background:
> > >> Separate 2x SSD RAID 1 root (/dev/sda. /dev/sdb) on the MB (Supemicro
> > >> X10 series)'s chipset SATA ports.
> > >> All filesystems are ext4, data=journal, nodelalloc, the 'data' RAIDs
> > >> have journals on another SSD RAID1 (one per FS, obviously).
> > >> Data drives:
> > >> 12 x 4'TB' Seagate drives, NC000n variety, on 2x LSI 2308 controllers,
> > >> each with two four-drive ports (and one of these went DELIGHTFULLY
> > >> missing)
> > >>
> > >> This is the layout of each drive:
> > >> ---
> > >> GPT fdisk (gdisk) version 1.0.6
> > >> ...
> > >> Found valid GPT with protective MBR; using GPT.
> > >> Disk /dev/sdc: 7814037168 sectors, 3.6 TiB
> > >> Model: ST4000NC001-1FS1
> > >> Sector size (logical/physical): 512/4096 bytes
> > >> ...
> > >> Total free space is 99949 sectors (48.8 MiB)
> > >>
> > >> Number  Start (sector)    End (sector)  Size       Code  Name
> > >>     1            2048      7625195519   3.5 TiB     8300  Linux RAID volume
> > >>     2      7625195520      7813939199   90.0 GiB    8300  Linux RAID backup
> > >> ---
> > >>
> > >> So there were two RAID arrays. Both RAID5 - a main RAID called
> > >> 'archive' which had the 12 x 3.5ish partitions sdx1 and a second array
> > >> called backup which had 12 x 90 GB.
> > >>
> > >> A little further backstory: right before the event, one drive had been
> > >> pulled because it had started failing. What I did was shut down the
> > >> machine, put the failing drive on a MB port and put a new drive on the
> > >> LSI controllers. I then brought the machine back online, did the
> > >> --replace --with thing and this worked fine.
> > >> At that point the faulty drive (/dev/sdc, MB drives come before the
> > >> LSI drives in the count) got deleted via /sys/block.... and physically
> > >> disconnected from the system, which was then happily running with
> > >> /dev/sda and /dev/sdb as the root RAID SSDs and drives sdd -> sdo as
> > >> the 'archive' drives.
> > >> It went 96 hours or so like that under moderate load. Then the failure
> > >> happened, the machine was rebooted thus the previous sdd -> sdo drives
> > >> became sdc -> sdn drives.
> > >> However, the relative order was, to the best of my knowledge,
> > >> conserved - AND I still have the 'faulty' drive, so I could very
> > >> easily put it back in to have everything match.
> > >> Most importantly, this drive has on it, without a doubt, the details
> > >> of the array BEFORE everything happened - by definition untouched
> > >> because the drive was stopped and pulled before the event.
> > >> I also have a cat of the --examine of two of the faulty drives BEFORE
> > >> anything was written to them - thus, unless I am mistaken, these
> > >> contained the md block details from 'before the event'.
> > >>
> > >> Here is one of them, taken after the reboot and therefore when the MB
> > >> /dev/sdc was no longer there:
> > >> ---
> > >> /dev/sdc1:
> > >>            Magic : a92b4efc
> > >>          Version : 0.90.00
> > >>             UUID : 2457b506:85728e9d:c44c77eb:7ee19756
> > >>    Creation Time : Sat Mar 30 18:18:00 2019
> > >>       Raid Level : raid5
> > >>    Used Dev Size : -482370688 (3635.98 GiB 3904.10 GB)
> > >>       Array Size : 41938562688 (39995.73 GiB 42945.09 GB)
> > >>     Raid Devices : 12
> > >>    Total Devices : 12
> > >> Preferred Minor : 123
> > >>
> > >>      Update Time : Tue Sep  6 11:37:53 2022
> > >>            State : clean
> > >>   Active Devices : 12
> > >> Working Devices : 12
> > >>   Failed Devices : 0
> > >>    Spare Devices : 0
> > >>         Checksum : 391e325d - correct
> > >>           Events : 52177
> > >>
> > >>           Layout : left-symmetric
> > >>       Chunk Size : 128K
> > >>
> > >>        Number   Major   Minor   RaidDevice State
> > >> this     5       8       49        5      active sync   /dev/sdd1
> > >>
> > >>     0     0       8      225        0      active sync
> > >>     1     1       8       81        1      active sync   /dev/sdf1
> > >>     2     2       8       97        2      active sync   /dev/sdg1
> > >>     3     3       8      161        3      active sync   /dev/sdk1
> > >>     4     4       8      113        4      active sync   /dev/sdh1
> > >>     5     5       8       49        5      active sync   /dev/sdd1
> > >>     6     6       8      177        6      active sync   /dev/sdl1
> > >>     7     7       8      145        7      active sync   /dev/sdj1
> > >>     8     8       8      129        8      active sync   /dev/sdi1
> > >>     9     9       8       65        9      active sync   /dev/sde1
> > >>    10    10       8      209       10      active sync   /dev/sdn1
> > >>    11    11       8      193       11      active sync   /dev/sdm1
> > >> ---
> > >> Note that the drives are 'moved' because the old /dev/sdc isn't there
> > >> any more but the relative position should be the same, correct me if I
> > >> am wrong. If you prefer, what you need to do to get the 'new' drive
> > >> letter is to take 16 out of the minor of each of the drives.
> > >>
> > >> This is the 'new' --create
> > >> ---
> > >> /dev/sdc1:
> > >>            Magic : a92b4efc
> > >>          Version : 0.90.00
> > >>             UUID : 79990944:0bb9420b:97d5a417:7d4e9ef8 (local to host beehive)
> > >>    Creation Time : Tue Sep  6 15:15:03 2022
> > >>       Raid Level : raid5
> > >>    Used Dev Size : -482370688 (3635.98 GiB 3904.10 GB)
> > >>       Array Size : 41938562688 (39995.73 GiB 42945.09 GB)
> > >>     Raid Devices : 12
> > >>    Total Devices : 12
> > >> Preferred Minor : 123
> > >>
> > >>      Update Time : Tue Sep  6 15:15:03 2022
> > >>            State : clean
> > >>   Active Devices : 12
> > >> Working Devices : 12
> > >>   Failed Devices : 0
> > >>    Spare Devices : 0
> > >>         Checksum : ed12b96a - correct
> > >>           Events : 1
> > >>
> > >>           Layout : left-symmetric
> > >>       Chunk Size : 128K
> > >>
> > >>        Number   Major   Minor   RaidDevice State
> > >> this     5       8       33        5      active sync   /dev/sdc1
> > >>
> > >>     0     0       8      209        0      active sync   /dev/sdn1
> > >>     1     1       8       65        1      active sync   /dev/sde1
> > >>     2     2       8       81        2      active sync   /dev/sdf1
> > >>     3     3       8      145        3      active sync   /dev/sdj1
> > >>     4     4       8       97        4      active sync   /dev/sdg1
> > >>     5     5       8       33        5      active sync   /dev/sdc1
> > >>     6     6       8      161        6      active sync   /dev/sdk1
> > >>     7     7       8      129        7      active sync   /dev/sdi1
> > >>     8     8       8      113        8      active sync   /dev/sdh1
> > >>     9     9       8       49        9      active sync   /dev/sdd1
> > >>    10    10       8      193       10      active sync   /dev/sdm1
> > >>    11    11       8      177       11      active sync   /dev/sdl1
> > >> ---
> > >>
> > >> If you put the layout lines side by side, it would seem to me that
> > >> they match, modulo the '16' difference.
> > >>
> > >> This is the list of --create and --assemble commands from the 6th
> > >> which involve the sdx1 partitions, those we care about right now -
> > >> there were others involving /dev/md124 and the /dev/sdx2 which however
> > >> are not relevant - the data there :
> > >> --
> > >>   9813  mdadm --assemble /dev/md123 missing
> > >>   9814  mdadm --assemble /dev/md123 missing /dev/sdf1 /dev/sdg1
> > >> /dev/sdk1 /dev/sdh1 /dev/sdd1 /dev/sdl1 /dev/sdj1 /dev/sdi1 /dev/sde1
> > >> /dev/sdn1 /dev/sdm1
> > >>   9815  mdadm --assemble /dev/md123 /dev/sdf1 /dev/sdg1 /dev/sdk1
> > >> /dev/sdh1 /dev/sdd1 /dev/sdl1 /dev/sdj1 /dev/sdi1 /dev/sde1 /dev/sdn1
> > >> /dev/sdm1
> > >>   9823  mdadm --create -o -n 12 -l 5 /dev/md124 missing /dev/sde1
> > >> /dev/sdf1 /dev/sdj1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdd1
> > >> /dev/sdm1 /dev/sdl1
> > >>   9824  mdadm --create -o -n 12 -l 5 /dev/md124 missing /dev/sde1
> > >> /dev/sdf1 /dev/sdj1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1
> > >> /dev/sdd1 /dev/sdm1 /dev/sdl1
> > >> ^^^^ note that these were the WRONG ARRAY - this was an unfortunate
> > >> miscommunication which caused potential damage.
> > >>   9852  mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90
> > >> --chunk=128 /dev/md123 /dev/sdn1 /dev/sdd1 /dev/sdf1 /dev/sde1
> > >> /dev/sdg1 /dev/sdj1 /dev/sdi1 /dev/sdm1 /dev/sdh1 /dev/sdk1 /dev/sdl1
> > >>   9863  mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90
> > >> --chunk=128 /dev/md123 /dev/sdn1 /dev/sdc1 /dev/sdd1 /dev/sdf1
> > >> /dev/sde1 /dev/sdg1 /dev/sdj1 /dev/sdi1 /dev/sdm1 /dev/sdh1 /dev/sdk1
> > >> /dev/sdl1
> > >>   9879  mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90
> > >> --chunk=128 --bitmap=none /dev/md123 /dev/sdn1 /dev/sdc1 /dev/sdd1
> > >> /dev/sdf1 /dev/sde1 /dev/sdg1 /dev/sdj1 /dev/sdi1 /dev/sdm1 /dev/sdh1
> > >> /dev/sdk1 /dev/sdl1
> > >>   9889  mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90
> > >> --chunk=128 --bitmap=none /dev/md123 /dev/sdn1 /dev/sde1 /dev/sdf1
> > >> /dev/sdl1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1 /dev/sdd1
> > >> /dev/sdm1 /dev/sdl1
> > >>   9892  mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90
> > >> --chunk=128 --bitmap=none /dev/md123 /dev/sdn1 /dev/sde1 /dev/sdf1
> > >> /dev/sdl1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1 /dev/sdd1
> > >> /dev/sdm1 /dev/sdl1
> > >>   9895  mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90
> > >> --chunk=128 --bitmap=none /dev/md123 /dev/sdn1 /dev/sde1 /dev/sdf1
> > >> /dev/sdj1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1 /dev/sdd1
> > >> /dev/sdm1 /dev/sdl1
> > >>   9901  mdadm --assemble /dev/md123 /dev/sdn1 /dev/sde1 /dev/sdf1
> > >> /dev/sdl1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1 /dev/sdd1
> > >> /dev/sdm1 /dev/sdl1
> > >>   9903  mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90
> > >> --chunk=128 --bitmap=none /dev/md123 /dev/sdn1 /dev/sde1 /dev/sdf1
> > >> /dev/sdj1 /dev/sdg1 / dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1 /dev/sdd1
> > >> /dev/sdm1 /dev/sdl1
> > >> ---
> > >>
> > >> Note that they all were -o, therefore if I am not mistaken no parity
> > >> data was written anywhere. Note further the fact that the first two
> > >> were the 'mistake' ones, which did NOT have --assume-clean (but with
> > >> -o this shouldn't make a difference AFAIK) and most importantly the
> > >> metadata was the 1.2 default AND they were the wrong array in the
> > >> first place.
> > >> Note also that the 'final' --create commands also had --bitmap=none to
> > >> match the original array, though according to the docs the bitmap
> > >> space in 0.90 (and 1.2?) is in a space which does not affect the data
> > >> in the first place.
> > >>
> > >> Now, first of all a question: if I get the 'old' sdc, the one that was
> > >> taken out prior to this whole mess, onto a different system in order
> > >> to examine it, the modern mdraid auto discovery shoud NOT overwrite
> > >> the md data, correct? Thus I should be able to double-check the drive
> > >> order on that as well?
> > >>
> > >> Any other pointers, insults etc are of course welcome.
> >