A further question, in THIS boot's log I found: [ 9874.709903] md/raid:md123: raid level 5 active with 12 out of 12 devices, algorithm 2 [ 9874.710249] md123: bitmap file is out of date (0 < 1) -- forcing full recovery [ 9874.714178] md123: bitmap file is out of date, doing full recovery [ 9874.881106] md123: detected capacity change from 0 to 42945088192512 From, I think, the second --create of /dev/123, before I added the bitmap=none. This should, however, not have written anything with -o and --assume-clean, correct? On Fri, Sep 9, 2022 at 6:50 PM Luigi Fabio <luigi.fabio@xxxxxxxxx> wrote: > > By different kernels, maybe - but the kernel has been the same for > quite a while (months). > > I did paste the whole of the command lines in the (very long) email, > as David mentions (thanks!) - the first ones, the mistaken ones, did > NOT have --assume-clean but they did have -o, so no parity activity > should have started according to the docs? > A new thought came to mind: one of the HBAs lost a channel, right? > What if on the subsequent reboot the devices that were on that channel > got 'rediscovered' and shunted to the end of the letter order? That > would, I believe, be ordinary operating procedure. > That would give us an almost-correct array, which would explain how > fsck can get ... some pieces. > > Also, I am not quite brave enough (...) to use shortcuts when handling > mdadm commands. > > I am reconstructing the port order (scsi targets, if you prefer) from > the 20220904 boot log. I should at that point be able to have an exact > order of the drives. > > Here it is: > > --- > [ 1.853329] sd 2:0:0:0: [sda] Write Protect is off > [ 1.853331] sd 7:0:0:0: [sdc] Write Protect is off > [ 1.853382] sd 3:0:0:0: [sdb] Write Protect is off > [ 12.531607] sd 10:0:3:0: [sdg] Write Protect is off > [ 12.533303] sd 10:0:2:0: [sdf] Write Protect is off > [ 12.534606] sd 10:0:0:0: [sdd] Write Protect is off > [ 12.570768] sd 10:0:1:0: [sde] Write Protect is off > [ 12.959925] sd 11:0:0:0: [sdh] Write Protect is off > [ 12.965230] sd 11:0:1:0: [sdi] Write Protect is off > [ 12.966145] sd 11:0:4:0: [sdl] Write Protect is off > [ 12.966800] sd 11:0:3:0: [sdk] Write Protect is off > [ 12.997253] sd 11:0:2:0: [sdj] Write Protect is off > [ 13.002395] sd 11:0:7:0: [sdo] Write Protect is off > [ 13.012693] sd 11:0:5:0: [sdm] Write Protect is off > [ 13.017630] sd 11:0:6:0: [sdn] Write Protect is off > --- > If we combine this with the previous: > --- > [ 13.528395] md/raid:md123: device sdd1 operational as raid disk 5 > [ 13.528396] md/raid:md123: device sde1 operational as raid disk 9 > [ 13.528397] md/raid:md123: device sdg1 operational as raid disk 2 > [ 13.528398] md/raid:md123: device sdf1 operational as raid disk 1 > [ 13.528398] md/raid:md123: device sdh1 operational as raid disk 4 > [ 13.528399] md/raid:md123: device sdk1 operational as raid disk 3 > [ 13.528400] md/raid:md123: device sdj1 operational as raid disk 7 > [ 13.528401] md/raid:md123: device sdn1 operational as raid disk 10 > [ 13.528402] md/raid:md123: device sdi1 operational as raid disk 8 > [ 13.528402] md/raid:md123: device sdl1 operational as raid disk 6 > [ 13.528403] md/raid:md123: device sdm1 operational as raid disk 11 > [ 13.528403] md/raid:md123: device sdc1 operational as raid disk 0 > [ 13.531613] md/raid:md123: raid level 5 active with 12 out of 12 > devices, algorithm 2 > [ 13.531644] md123: detected capacity change from 0 to 42945088192512 > --- > We have a SCSI target -> raid disk number correspondence. > As of this boot, the letter -> scsi target correspondences match, > shifted by one because as discussed 7:0:0:0 is no longer there (the > old, 'faulty' sdc). > Thus, having univocally determined the prior scsi target -> raid > position we can transpose it to the present drive letters, which are > shifted by one. > Therefore, we can generate, rectius have generated, a --create with > the same software versions, the same settings and the same drive > order. Is there any reason why, minus the 1.2 metadata overwriting > which should have only affected 12 blocks, the fs should 'not' be as > before? > Genuine question, mind. > > On Fri, Sep 9, 2022 at 5:48 PM Phil Turmel <philip@xxxxxxxxxx> wrote: > > > > Reasonably likely, but not certain. > > > > Devices can be re-ordered by different kernels. That's why lsdrv prints > > serial numbers in its tree. > > > > You haven't mentioned whether your --create operations specified > > --assume-clean. > > > > Also, be aware that shell expansion of something like /dev/sd[dcbaefgh] > > is sorted to /dev/sd[abcdefgh]. Use curly brace expansion with commas > > if you are taking shortcuts. > > > > On 9/9/22 17:01, Luigi Fabio wrote: > > > Another helpful datapoint, this is the boot *before* sdc got > > > --replaced with sdo: > > > > > > [ 13.528395] md/raid:md123: device sdd1 operational as raid disk 5 > > > [ 13.528396] md/raid:md123: device sde1 operational as raid disk 9 > > > [ 13.528397] md/raid:md123: device sdg1 operational as raid disk 2 > > > [ 13.528398] md/raid:md123: device sdf1 operational as raid disk 1 > > > [ 13.528398] md/raid:md123: device sdh1 operational as raid disk 4 > > > [ 13.528399] md/raid:md123: device sdk1 operational as raid disk 3 > > > [ 13.528400] md/raid:md123: device sdj1 operational as raid disk 7 > > > [ 13.528401] md/raid:md123: device sdn1 operational as raid disk 10 > > > [ 13.528402] md/raid:md123: device sdi1 operational as raid disk 8 > > > [ 13.528402] md/raid:md123: device sdl1 operational as raid disk 6 > > > [ 13.528403] md/raid:md123: device sdm1 operational as raid disk 11 > > > [ 13.528403] md/raid:md123: device sdc1 operational as raid disk 0 > > > [ 13.531613] md/raid:md123: raid level 5 active with 12 out of 12 > > > devices, algorithm 2 > > > [ 13.531644] md123: detected capacity change from 0 to 42945088192512 > > > > > > This gives us, correct me if I am wrong of course, an exact > > > representation of what the array 'used to look like', with sdc1 then > > > replaced by sdo1 (8/225). > > > > > > Just some confirmation that the order should (?) be the one above. > > > > > > LF > > > > > > On Fri, Sep 9, 2022 at 4:32 PM Luigi Fabio <luigi.fabio@xxxxxxxxx> wrote: > > >> > > >> Thanks for reaching out, first of all. Apologies for the late reply, > > >> the brilliant (...) spam filter strikes again... > > >> > > >> On Thu, Sep 8, 2022 at 1:23 PM Phil Turmel <philip@xxxxxxxxxx> wrote: > > >>> No, the moment of stupid was that you re-created the array. > > >>> Simultaneous multi-drive failures that stop an array are easily fixed > > >>> with --assemble --force. Too late for that now. > > >> Noted for the future, thanks. > > >> > > >>> It is absurdly easy to screw up device order when re-creating, and if > > >>> you didn't specify every allocation and layout detail, the changes in > > >>> defaults over the years would also screw up your data. And finally, > > >>> omitting --assume-clean would cause all of your parity to be > > >>> recalculated immediately, with catastrophic results if any order or > > >>> allocation attributes are wrong. > > >> Of course. Which is why I specified everything and why I checked the > > >> details with --examine and --detail and they match exactly, minus the > > >> metadata version because, well, I wasn't actually the one typing (it's > > >> a slightly complicated story.. I was reassembling by proxy on the > > >> phone) and I made an incorrect assumption about the person typing. > > >> There aren't, in the end, THAT many things to specify: RAID level, > > >> number of drives, order thereof, chunk size, 'layout' and metadata > > >> version. 0.90 doesn't allow before/after gaps so that should be it, I > > >> believe. > > >> Am I missing anything? > > >> > > >>> No, you just got lucky in the past. Probably by using mdadm versions > > >>> that hadn't been updated. > > >> That's not quite it: I keep records of how arrays are built and match > > >> them, though it is true that I tend to update things as little as > > >> possible on production machines. > > >> One of the differences, this time, is that this was NOT a production > > >> machine. The other was that I was driving, dictating on the phone and > > >> was under a lot of pressure to get the thing back up ASAP. > > >> Nonetheless, I have an --examine of at least two drives from the > > >> previous setup so there should be enough information there to rebuild > > >> a matching array, I think? > > >> > > >>> You'll need to show us every command you tried from your history, and > > >>> full details of all drives/partitions involved. > > >>> > > >>> But I'll be brutally honest: your data is likely toast. > > >> Well, let's hope it isn't. All mdadm commands were -o and > > >> --assume-clean, so in theory the only thing which HAS been written are > > >> the md blocks, unless I am mistaken and/or I read the docs > > >> incorrectly? > > >> > > >> That does, of course, leave the problem of the blocks overwritten by > > >> the 1.2 metadata, but as I read the docs that should be a very small > > >> number - let's say one 4096byte block (a portion thereof, to be > > >> pedantic, but ext4 doesn't really care?) per drive, correct? > > >> > > >> Background: > > >> Separate 2x SSD RAID 1 root (/dev/sda. /dev/sdb) on the MB (Supemicro > > >> X10 series)'s chipset SATA ports. > > >> All filesystems are ext4, data=journal, nodelalloc, the 'data' RAIDs > > >> have journals on another SSD RAID1 (one per FS, obviously). > > >> Data drives: > > >> 12 x 4'TB' Seagate drives, NC000n variety, on 2x LSI 2308 controllers, > > >> each with two four-drive ports (and one of these went DELIGHTFULLY > > >> missing) > > >> > > >> This is the layout of each drive: > > >> --- > > >> GPT fdisk (gdisk) version 1.0.6 > > >> ... > > >> Found valid GPT with protective MBR; using GPT. > > >> Disk /dev/sdc: 7814037168 sectors, 3.6 TiB > > >> Model: ST4000NC001-1FS1 > > >> Sector size (logical/physical): 512/4096 bytes > > >> ... > > >> Total free space is 99949 sectors (48.8 MiB) > > >> > > >> Number Start (sector) End (sector) Size Code Name > > >> 1 2048 7625195519 3.5 TiB 8300 Linux RAID volume > > >> 2 7625195520 7813939199 90.0 GiB 8300 Linux RAID backup > > >> --- > > >> > > >> So there were two RAID arrays. Both RAID5 - a main RAID called > > >> 'archive' which had the 12 x 3.5ish partitions sdx1 and a second array > > >> called backup which had 12 x 90 GB. > > >> > > >> A little further backstory: right before the event, one drive had been > > >> pulled because it had started failing. What I did was shut down the > > >> machine, put the failing drive on a MB port and put a new drive on the > > >> LSI controllers. I then brought the machine back online, did the > > >> --replace --with thing and this worked fine. > > >> At that point the faulty drive (/dev/sdc, MB drives come before the > > >> LSI drives in the count) got deleted via /sys/block.... and physically > > >> disconnected from the system, which was then happily running with > > >> /dev/sda and /dev/sdb as the root RAID SSDs and drives sdd -> sdo as > > >> the 'archive' drives. > > >> It went 96 hours or so like that under moderate load. Then the failure > > >> happened, the machine was rebooted thus the previous sdd -> sdo drives > > >> became sdc -> sdn drives. > > >> However, the relative order was, to the best of my knowledge, > > >> conserved - AND I still have the 'faulty' drive, so I could very > > >> easily put it back in to have everything match. > > >> Most importantly, this drive has on it, without a doubt, the details > > >> of the array BEFORE everything happened - by definition untouched > > >> because the drive was stopped and pulled before the event. > > >> I also have a cat of the --examine of two of the faulty drives BEFORE > > >> anything was written to them - thus, unless I am mistaken, these > > >> contained the md block details from 'before the event'. > > >> > > >> Here is one of them, taken after the reboot and therefore when the MB > > >> /dev/sdc was no longer there: > > >> --- > > >> /dev/sdc1: > > >> Magic : a92b4efc > > >> Version : 0.90.00 > > >> UUID : 2457b506:85728e9d:c44c77eb:7ee19756 > > >> Creation Time : Sat Mar 30 18:18:00 2019 > > >> Raid Level : raid5 > > >> Used Dev Size : -482370688 (3635.98 GiB 3904.10 GB) > > >> Array Size : 41938562688 (39995.73 GiB 42945.09 GB) > > >> Raid Devices : 12 > > >> Total Devices : 12 > > >> Preferred Minor : 123 > > >> > > >> Update Time : Tue Sep 6 11:37:53 2022 > > >> State : clean > > >> Active Devices : 12 > > >> Working Devices : 12 > > >> Failed Devices : 0 > > >> Spare Devices : 0 > > >> Checksum : 391e325d - correct > > >> Events : 52177 > > >> > > >> Layout : left-symmetric > > >> Chunk Size : 128K > > >> > > >> Number Major Minor RaidDevice State > > >> this 5 8 49 5 active sync /dev/sdd1 > > >> > > >> 0 0 8 225 0 active sync > > >> 1 1 8 81 1 active sync /dev/sdf1 > > >> 2 2 8 97 2 active sync /dev/sdg1 > > >> 3 3 8 161 3 active sync /dev/sdk1 > > >> 4 4 8 113 4 active sync /dev/sdh1 > > >> 5 5 8 49 5 active sync /dev/sdd1 > > >> 6 6 8 177 6 active sync /dev/sdl1 > > >> 7 7 8 145 7 active sync /dev/sdj1 > > >> 8 8 8 129 8 active sync /dev/sdi1 > > >> 9 9 8 65 9 active sync /dev/sde1 > > >> 10 10 8 209 10 active sync /dev/sdn1 > > >> 11 11 8 193 11 active sync /dev/sdm1 > > >> --- > > >> Note that the drives are 'moved' because the old /dev/sdc isn't there > > >> any more but the relative position should be the same, correct me if I > > >> am wrong. If you prefer, what you need to do to get the 'new' drive > > >> letter is to take 16 out of the minor of each of the drives. > > >> > > >> This is the 'new' --create > > >> --- > > >> /dev/sdc1: > > >> Magic : a92b4efc > > >> Version : 0.90.00 > > >> UUID : 79990944:0bb9420b:97d5a417:7d4e9ef8 (local to host beehive) > > >> Creation Time : Tue Sep 6 15:15:03 2022 > > >> Raid Level : raid5 > > >> Used Dev Size : -482370688 (3635.98 GiB 3904.10 GB) > > >> Array Size : 41938562688 (39995.73 GiB 42945.09 GB) > > >> Raid Devices : 12 > > >> Total Devices : 12 > > >> Preferred Minor : 123 > > >> > > >> Update Time : Tue Sep 6 15:15:03 2022 > > >> State : clean > > >> Active Devices : 12 > > >> Working Devices : 12 > > >> Failed Devices : 0 > > >> Spare Devices : 0 > > >> Checksum : ed12b96a - correct > > >> Events : 1 > > >> > > >> Layout : left-symmetric > > >> Chunk Size : 128K > > >> > > >> Number Major Minor RaidDevice State > > >> this 5 8 33 5 active sync /dev/sdc1 > > >> > > >> 0 0 8 209 0 active sync /dev/sdn1 > > >> 1 1 8 65 1 active sync /dev/sde1 > > >> 2 2 8 81 2 active sync /dev/sdf1 > > >> 3 3 8 145 3 active sync /dev/sdj1 > > >> 4 4 8 97 4 active sync /dev/sdg1 > > >> 5 5 8 33 5 active sync /dev/sdc1 > > >> 6 6 8 161 6 active sync /dev/sdk1 > > >> 7 7 8 129 7 active sync /dev/sdi1 > > >> 8 8 8 113 8 active sync /dev/sdh1 > > >> 9 9 8 49 9 active sync /dev/sdd1 > > >> 10 10 8 193 10 active sync /dev/sdm1 > > >> 11 11 8 177 11 active sync /dev/sdl1 > > >> --- > > >> > > >> If you put the layout lines side by side, it would seem to me that > > >> they match, modulo the '16' difference. > > >> > > >> This is the list of --create and --assemble commands from the 6th > > >> which involve the sdx1 partitions, those we care about right now - > > >> there were others involving /dev/md124 and the /dev/sdx2 which however > > >> are not relevant - the data there : > > >> -- > > >> 9813 mdadm --assemble /dev/md123 missing > > >> 9814 mdadm --assemble /dev/md123 missing /dev/sdf1 /dev/sdg1 > > >> /dev/sdk1 /dev/sdh1 /dev/sdd1 /dev/sdl1 /dev/sdj1 /dev/sdi1 /dev/sde1 > > >> /dev/sdn1 /dev/sdm1 > > >> 9815 mdadm --assemble /dev/md123 /dev/sdf1 /dev/sdg1 /dev/sdk1 > > >> /dev/sdh1 /dev/sdd1 /dev/sdl1 /dev/sdj1 /dev/sdi1 /dev/sde1 /dev/sdn1 > > >> /dev/sdm1 > > >> 9823 mdadm --create -o -n 12 -l 5 /dev/md124 missing /dev/sde1 > > >> /dev/sdf1 /dev/sdj1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdd1 > > >> /dev/sdm1 /dev/sdl1 > > >> 9824 mdadm --create -o -n 12 -l 5 /dev/md124 missing /dev/sde1 > > >> /dev/sdf1 /dev/sdj1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1 > > >> /dev/sdd1 /dev/sdm1 /dev/sdl1 > > >> ^^^^ note that these were the WRONG ARRAY - this was an unfortunate > > >> miscommunication which caused potential damage. > > >> 9852 mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90 > > >> --chunk=128 /dev/md123 /dev/sdn1 /dev/sdd1 /dev/sdf1 /dev/sde1 > > >> /dev/sdg1 /dev/sdj1 /dev/sdi1 /dev/sdm1 /dev/sdh1 /dev/sdk1 /dev/sdl1 > > >> 9863 mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90 > > >> --chunk=128 /dev/md123 /dev/sdn1 /dev/sdc1 /dev/sdd1 /dev/sdf1 > > >> /dev/sde1 /dev/sdg1 /dev/sdj1 /dev/sdi1 /dev/sdm1 /dev/sdh1 /dev/sdk1 > > >> /dev/sdl1 > > >> 9879 mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90 > > >> --chunk=128 --bitmap=none /dev/md123 /dev/sdn1 /dev/sdc1 /dev/sdd1 > > >> /dev/sdf1 /dev/sde1 /dev/sdg1 /dev/sdj1 /dev/sdi1 /dev/sdm1 /dev/sdh1 > > >> /dev/sdk1 /dev/sdl1 > > >> 9889 mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90 > > >> --chunk=128 --bitmap=none /dev/md123 /dev/sdn1 /dev/sde1 /dev/sdf1 > > >> /dev/sdl1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1 /dev/sdd1 > > >> /dev/sdm1 /dev/sdl1 > > >> 9892 mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90 > > >> --chunk=128 --bitmap=none /dev/md123 /dev/sdn1 /dev/sde1 /dev/sdf1 > > >> /dev/sdl1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1 /dev/sdd1 > > >> /dev/sdm1 /dev/sdl1 > > >> 9895 mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90 > > >> --chunk=128 --bitmap=none /dev/md123 /dev/sdn1 /dev/sde1 /dev/sdf1 > > >> /dev/sdj1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1 /dev/sdd1 > > >> /dev/sdm1 /dev/sdl1 > > >> 9901 mdadm --assemble /dev/md123 /dev/sdn1 /dev/sde1 /dev/sdf1 > > >> /dev/sdl1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1 /dev/sdd1 > > >> /dev/sdm1 /dev/sdl1 > > >> 9903 mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90 > > >> --chunk=128 --bitmap=none /dev/md123 /dev/sdn1 /dev/sde1 /dev/sdf1 > > >> /dev/sdj1 /dev/sdg1 / dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1 /dev/sdd1 > > >> /dev/sdm1 /dev/sdl1 > > >> --- > > >> > > >> Note that they all were -o, therefore if I am not mistaken no parity > > >> data was written anywhere. Note further the fact that the first two > > >> were the 'mistake' ones, which did NOT have --assume-clean (but with > > >> -o this shouldn't make a difference AFAIK) and most importantly the > > >> metadata was the 1.2 default AND they were the wrong array in the > > >> first place. > > >> Note also that the 'final' --create commands also had --bitmap=none to > > >> match the original array, though according to the docs the bitmap > > >> space in 0.90 (and 1.2?) is in a space which does not affect the data > > >> in the first place. > > >> > > >> Now, first of all a question: if I get the 'old' sdc, the one that was > > >> taken out prior to this whole mess, onto a different system in order > > >> to examine it, the modern mdraid auto discovery shoud NOT overwrite > > >> the md data, correct? Thus I should be able to double-check the drive > > >> order on that as well? > > >> > > >> Any other pointers, insults etc are of course welcome. > >