Re: RAID5 failure and consequent ext4 problems

Luigi Fabio <luigi.fabio@xxxxxxxxx> · Sat, 10 Sep 2022 15:55:31 -0400

Well, I found SOMETHING of decided interest: when I run dumpe2fs with
any backup superblock, this happens:

---
Filesystem created:       Tue Nov  4 08:56:08 2008
Last mount time:          Thu Aug 18 21:04:22 2022
Last write time:          Thu Aug 18 21:04:22 2022
---

So the backups have not been updated since boot-before-last? That
would explain why, when fsck tries to use those backups, it comes up
with funny results.

Is this ...as intended, I wonder? Does it also imply that any file
that was written to > aug 18th will be in an indeterminate state? That
would seem to be the implication.

On Sat, Sep 10, 2022 at 3:30 PM Luigi Fabio <luigi.fabio@xxxxxxxxx> wrote:
>
> Hello Phil,
> thank you BTW for your continued assistance. Here goes:
>
> On Sat, Sep 10, 2022 at 11:18 AM Phil Turmel <philip@xxxxxxxxxx> wrote:
> > Yes.  Same kernels are pretty repeatable for device order on bootup as
> > long as all are present.  Anything missing will shift the letter
> > assignments.
> We need to keep this in mind, though the described boot log scsi
> target -> letter assignment seem to indicate that we're clear as
> discussed. This is relevant since I have re--created the array.
>
> > Okay, that should have saved you.  Except, I think it still writes all
> > the meta-data.  With v1.2, that would sparsely trash up to 1/4 gig at
> > tbe beginning of each device.
> I dug into the docs and the wiki and ran some experiments on another
> machine. Apparently, what 1.2 does with my kernel and my mdadm is use
> sectors 9 to 80 of each device. Thus, it borked 72 512-byte sectors ->
> 36 kB -> 9 ext3 blocks per device, sparsely as you say.
> This is 'fine' even with a 128kB chunk, the first one doesn't really
> matter because yes, fsck detects that it nuked the block group
> descriptors but the superblock before them is fine (indeed, tune2fs
> and dumpe2fs work 'as expected') and then goes to a backup and is
> happy, even declaring the fs clean.
> Therefore out of the 12 'affected' areas, one doesn't matter for
> practical purposes and we have to wonder about the others.  Arguably,
> one of those should also be managed by parity but I have no idea how
> that will work out - it may be very important actually at the time of
> any future resync.
> Now, these are all in the first block of each device, which would form
> the first 1408 kB of the filesystem (128kB chunk, remember the
> original creation is *old*), since I believe mdraid preserves
> sequence, therefore the chunks are in order.
> We know the following from dumpe2fs:
> ---
> Group 0: (Blocks 0-32767) csum 0x45ff [ITABLE_ZEROED]
>   Primary superblock at 0, Group descriptors at 1-2096
>   Block bitmap at 2260 (+2260), csum 0x824f8d47
>   Inode bitmap at 2261 (+2261), csum 0xdadef5ad
>   Inode table at 2262-2773 (+2262)
>   0 free blocks, 8179 free inodes, 2 directories, 8179 unused inodes
> ---
> So the first 2097 blocks are backed up group descriptors - this is
> *way* more than the 1408 kB therefore with restored BGDs (fsck -s
> 32768, say) we should be... fine?
>
> Now, if OTOH I do an -nf, all sorts of weird stuff happens but I have
> to wonder whether that's because the BGDs are not happy. I am tempted
> to run an overlay *for the fsck*, what do you think?
>
> > Well, yes.  But doesn't matter for assembly attempts, with always go by
> > the meta-data.  Device order only ever matters for --create when recreating.
> Sure, but keep in mind, my --create commands nuked the original 0.90
> metadata as well, so we need to be sure that the order is correct or
> we'll have a real jumble,
> Now, the cables have not been moved and the boot logs confirm that the
> scsi targets correspond, so we should have the order correct and the
> parameters are correct from the previous logs. Therefore, we 'should'
> have the same dataspa
>
> > If you consistently used -o or --assume-clean, then everything beyond
> > ~3G should be untouched, if you can get the order right.  Have fsck try
> > backup superblocks way out.
> fsck grabs a backup 'magically' and seems to be happy, unless I -nf it
> then ... all sorts of bad stuff happens.
>
> > Please use lsdrv to capture names versus serial numbers.  Re-run it
> > before any --create operation to ensure the current names really do
> > match the expected serial numbers.  Keep track of ordering information
> > by serial number.  Note that lsdrv will reliably line up PHYs on SAS
> > controllers, so that can be trusted, too.
> Thing is... I can't find lsdrv. As in: there is no lsdrv binary,
> apparently, in Debian stable or in Debian testing. Where do I look for
> it?
>
> > Superblocks other than 0.9x and 1.0 place a bad block log and a written
> > block bitmap between the superblock and the data area.  I'm not sure if
> > any of the remain space is wiped.  These would be written regardless of
> > -o or --assume-clean.  Those flags "protect" the *data area* of the
> > array, not the array's own metadata.
> Yes - this is the damage I'm talking about above. From the logs, the
> 'area' is 4096 sectors of which 4016 remain 'unused'. Therefore 80
> sectors, with the first 8 not being touched (and the proof is that the
> superblock is 'happy', though interestingly this should not be the
> case because the gr0 superblock is offset by 1024 bytes -> the last
> 1024 bytes of the superblock should be borked too.
> From this, my math above.
>
>
> >  > From, I think, the second --create of /dev/123, before I added the
> >  > bitmap=none. This should, however, not have written anything with -o
> >  > and --assume-clean, correct?
> > False assumption.  As described above.
> Two different things: what I meant was that even with that bitmap
> message, the only thing that would have been written is the metadata.
> linux raid documentation states repeatedly that with -o no resyncing
> or parity reconstruction would be performed. Yes, agreed, the 1.2
> metadata got written, but it's the only thing that got written from
> when the array was stopped by the error, if I am reading the docs
> correctly?
>
> > Okay.  To date, you've only done create with -o or --assume-clean?
> >
> > If so, it is likely your 0.90 superblocks are still present at the ends
> > of the disks.
> Problem is, if you look at my previous email, as I mentioned above I
> have ALSO done --create with --metadata=0.90, which overwrote the
> original blocks.
> HOWEVER, I do have the logs of the original parameters and I have at
> least one drive - the old sdc - which was spit out before this whole
> thing, which becomes relevant to confirm that the parameter log is
> correct (multiple things seem to coincide, so I think we're OK there).
>
> Given all the above, however, if we get the parameters to match we
> should get a filesystem that corresponds to before the event after the
> first 1408kB - and those don't matter insofar as we have redundant
> backups in ext4 for at least the first 2060 blocks >> 1408 kB.
>
> The thing that I do NOT understand is that if this is the case, fsck
> with -s <high> should render a FS without any errors.. therefore why
> am I getting inode metadata checksum errors? This is why I had
> originarily posted in linux-ext4 ...
>
> Thanks,
> L