Re: RAID5 failure and consequent ext4 problems

Luigi Fabio <luigi.fabio@xxxxxxxxx> · Sat, 10 Sep 2022 15:30:37 -0400

Hello Phil,
thank you BTW for your continued assistance. Here goes:

On Sat, Sep 10, 2022 at 11:18 AM Phil Turmel <philip@xxxxxxxxxx> wrote:
> Yes.  Same kernels are pretty repeatable for device order on bootup as
> long as all are present.  Anything missing will shift the letter
> assignments.
We need to keep this in mind, though the described boot log scsi
target -> letter assignment seem to indicate that we're clear as
discussed. This is relevant since I have re--created the array.

> Okay, that should have saved you.  Except, I think it still writes all
> the meta-data.  With v1.2, that would sparsely trash up to 1/4 gig at
> tbe beginning of each device.
I dug into the docs and the wiki and ran some experiments on another
machine. Apparently, what 1.2 does with my kernel and my mdadm is use
sectors 9 to 80 of each device. Thus, it borked 72 512-byte sectors ->
36 kB -> 9 ext3 blocks per device, sparsely as you say.
This is 'fine' even with a 128kB chunk, the first one doesn't really
matter because yes, fsck detects that it nuked the block group
descriptors but the superblock before them is fine (indeed, tune2fs
and dumpe2fs work 'as expected') and then goes to a backup and is
happy, even declaring the fs clean.
Therefore out of the 12 'affected' areas, one doesn't matter for
practical purposes and we have to wonder about the others.  Arguably,
one of those should also be managed by parity but I have no idea how
that will work out - it may be very important actually at the time of
any future resync.
Now, these are all in the first block of each device, which would form
the first 1408 kB of the filesystem (128kB chunk, remember the
original creation is *old*), since I believe mdraid preserves
sequence, therefore the chunks are in order.
We know the following from dumpe2fs:
---
Group 0: (Blocks 0-32767) csum 0x45ff [ITABLE_ZEROED]
  Primary superblock at 0, Group descriptors at 1-2096
  Block bitmap at 2260 (+2260), csum 0x824f8d47
  Inode bitmap at 2261 (+2261), csum 0xdadef5ad
  Inode table at 2262-2773 (+2262)
  0 free blocks, 8179 free inodes, 2 directories, 8179 unused inodes
---
So the first 2097 blocks are backed up group descriptors - this is
*way* more than the 1408 kB therefore with restored BGDs (fsck -s
32768, say) we should be... fine?

Now, if OTOH I do an -nf, all sorts of weird stuff happens but I have
to wonder whether that's because the BGDs are not happy. I am tempted
to run an overlay *for the fsck*, what do you think?

> Well, yes.  But doesn't matter for assembly attempts, with always go by
> the meta-data.  Device order only ever matters for --create when recreating.
Sure, but keep in mind, my --create commands nuked the original 0.90
metadata as well, so we need to be sure that the order is correct or
we'll have a real jumble,
Now, the cables have not been moved and the boot logs confirm that the
scsi targets correspond, so we should have the order correct and the
parameters are correct from the previous logs. Therefore, we 'should'
have the same dataspa

> If you consistently used -o or --assume-clean, then everything beyond
> ~3G should be untouched, if you can get the order right.  Have fsck try
> backup superblocks way out.
fsck grabs a backup 'magically' and seems to be happy, unless I -nf it
then ... all sorts of bad stuff happens.

> Please use lsdrv to capture names versus serial numbers.  Re-run it
> before any --create operation to ensure the current names really do
> match the expected serial numbers.  Keep track of ordering information
> by serial number.  Note that lsdrv will reliably line up PHYs on SAS
> controllers, so that can be trusted, too.
Thing is... I can't find lsdrv. As in: there is no lsdrv binary,
apparently, in Debian stable or in Debian testing. Where do I look for
it?

> Superblocks other than 0.9x and 1.0 place a bad block log and a written
> block bitmap between the superblock and the data area.  I'm not sure if
> any of the remain space is wiped.  These would be written regardless of
> -o or --assume-clean.  Those flags "protect" the *data area* of the
> array, not the array's own metadata.
Yes - this is the damage I'm talking about above. From the logs, the
'area' is 4096 sectors of which 4016 remain 'unused'. Therefore 80
sectors, with the first 8 not being touched (and the proof is that the
superblock is 'happy', though interestingly this should not be the
case because the gr0 superblock is offset by 1024 bytes -> the last
1024 bytes of the superblock should be borked too.
>From this, my math above.

>  > From, I think, the second --create of /dev/123, before I added the
>  > bitmap=none. This should, however, not have written anything with -o
>  > and --assume-clean, correct?
> False assumption.  As described above.
Two different things: what I meant was that even with that bitmap
message, the only thing that would have been written is the metadata.
linux raid documentation states repeatedly that with -o no resyncing
or parity reconstruction would be performed. Yes, agreed, the 1.2
metadata got written, but it's the only thing that got written from
when the array was stopped by the error, if I am reading the docs
correctly?

> Okay.  To date, you've only done create with -o or --assume-clean?
>
> If so, it is likely your 0.90 superblocks are still present at the ends
> of the disks.
Problem is, if you look at my previous email, as I mentioned above I
have ALSO done --create with --metadata=0.90, which overwrote the
original blocks.
HOWEVER, I do have the logs of the original parameters and I have at
least one drive - the old sdc - which was spit out before this whole
thing, which becomes relevant to confirm that the parameter log is
correct (multiple things seem to coincide, so I think we're OK there).

Given all the above, however, if we get the parameters to match we
should get a filesystem that corresponds to before the event after the
first 1408kB - and those don't matter insofar as we have redundant
backups in ext4 for at least the first 2060 blocks >> 1408 kB.

The thing that I do NOT understand is that if this is the case, fsck
with -s <high> should render a FS without any errors.. therefore why
am I getting inode metadata checksum errors? This is why I had
originarily posted in linux-ext4 ...

Thanks,
L