Re: RAID5 failure and consequent ext4 problems

Phil Turmel <philip@xxxxxxxxxx> · Sat, 10 Sep 2022 11:18:53 -0400

Hi Luigi,

Mixed in responses (and trimmed):

On 9/9/22 18:50, Luigi Fabio wrote:
> By different kernels, maybe - but the kernel has been the same for
> quite a while (months).

Yes.  Same kernels are pretty repeatable for device order on bootup as 
long as all are present.  Anything missing will shift the letter 
assignments.

> I did paste the whole of the command lines in the (very long) email,
> as David mentions (thanks!) - the first ones, the mistaken ones, did
> NOT have --assume-clean but they did have -o, so no parity activity
> should have started according to the docs?

Okay, that should have saved you.  Except, I think it still writes all 
the meta-data.  With v1.2, that would sparsely trash up to 1/4 gig at 
tbe beginning of each device.

> A new thought came to mind: one of the HBAs lost a channel, right?
> What if on the subsequent reboot the devices that were on that channel
> got 'rediscovered' and shunted to the end of the letter order? That
> would, I believe, be ordinary operating procedure.

Well, yes.  But doesn't matter for assembly attempts, with always go by 
the meta-data.  Device order only ever matters for --create when recreating.

> That would give us an almost-correct array, which would explain how
> fsck can get ... some pieces.

If you consistently used -o or --assume-clean, then everything beyond 
~3G should be untouched, if you can get the order right.  Have fsck try 
backup superblocks way out.

> Also, I am not quite brave enough (...) to use shortcuts when handling
> mdadm commands.

That's good.  But curly braces are safe.

> I am reconstructing the port order (scsi targets, if you prefer) from
> the 20220904 boot log. I should at that point be able to have an exact
> order of the drives.

Please use lsdrv to capture names versus serial numbers.  Re-run it 
before any --create operation to ensure the current names really do 
match the expected serial numbers.  Keep track of ordering information 
by serial number.  Note that lsdrv will reliably line up PHYs on SAS 
controllers, so that can be trusted, too.

> Here it is:

[trim /]

> We have a SCSI target -> raid disk number correspondence.
> As of this boot, the letter -> scsi target correspondences match,
> shifted by one because as discussed 7:0:0:0 is no longer there (the
> old, 'faulty' sdc).

OK.

> Thus, having univocally determined the prior scsi target -> raid
> position we can transpose it to the present drive letters, which are
> shifted by one.
> Therefore, we can generate, rectius have generated, a --create with
> the same software versions, the same settings and the same drive
> order. Is there any reason why, minus the 1.2 metadata overwriting
> which should have only affected 12 blocks, the fs should 'not' be as
> before?
> Genuine question, mind.

Superblocks other than 0.9x and 1.0 place a bad block log and a written 
block bitmap between the superblock and the data area.  I'm not sure if 
any of the remain space is wiped.  These would be written regardless of 
-o or --assume-clean.  Those flags "protect" the *data area* of the 
array, not the array's own metadata.

On 9/9/22 19:04, Luigi Fabio wrote:
> A further question, in THIS boot's log I found:
> [ 9874.709903] md/raid:md123: raid level 5 active with 12 out of 12
> devices, algorithm 2
> [ 9874.710249] md123: bitmap file is out of date (0 < 1) -- forcing
> full recovery
> [ 9874.714178] md123: bitmap file is out of date, doing full recovery
> [ 9874.881106] md123: detected capacity change from 0 to 42945088192512
> From, I think, the second --create of /dev/123, before I added the
> bitmap=none. This should, however, not have written anything with -o
> and --assume-clean, correct?

False assumption.  As described above.

On 9/9/22 21:29, Luigi Fabio wrote:
> For completeness' sake, though it should not be relevant, here is the
> error that caused the mishap:

[trim /]

Noted, and helpful for correlating device names to PHYs.

Okay.  To date, you've only done create with -o or --assume-clean?

If so, it is likely your 0.90 superblocks are still present at the ends 
of the disks.

You will need to zero the v1.2 superblocks that have been placed on your 
partitions.  Then attempt an --assemble and see if mdadm will deliver 
the same message as before, identifying all of the members, but refusing 
to proceed due to event counts.

If so, repeat with --force.

This procedure is safe to do without overlays, and will likely yield a 
running array.

Then you will have to fsck to fix up the borked beginning of your 
filesystem.

Phil