Re: 5.18: likely useless very preliminary bug report: mdadm raid-6 boot-time assembly failure

Roger Heflin <rogerheflin@xxxxxxxxx> · Wed, 20 Jul 2022 14:50:36 -0500

try a fdisk -l /dev/sda4   (to see if there is a partition on the
partition).   That breaking stuff comes and goes.

So long as it does not show starts and stops you are ok.

It will look like this, if you are doing all of the work on your disk
then the mistake was probably not made.

In the below you could have an LVM device on sdfe1 (2nd block, or a
md-raid device) that the existence of the partition table hides.

And if the sdfe1p1 is found and configured then it blocks/hides
anything on sdfe1, and that depends on kernel scanning for partitions
and userspace tools scanning for partitions

fdisk -l /dev/sdfe

Disk /dev/sdfe: 128.8 GB, 128849018880 bytes, 251658240 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 16384 bytes / 16777216 bytes
Disk label type: dos
Disk identifier: 0xxxxxx

    Device Boot      Start         End      Blocks   Id  System
/dev/sdfe1           32768   251658239   125812736   83  Linux

08:34 PM # fdisk -l /dev/sdfe1

Disk /dev/sdfe1: 128.8 GB, 128832241664 bytes, 251625472 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 16384 bytes / 16777216 bytes
Disk label type: dos
Disk identifier: 0xxxxxxx

      Device Boot      Start         End      Blocks   Id  System
/dev/sdfe1p1           32768   251625471   125796352   8e  Linux LVM

My other though was that maybe some change caused the partition type
to start get used for something and if the type was wrong then ignore
it.

you might try a file -s /dev/sde1 against each partition that should
have mdadm and make sure it says mdadm and that there is not some
other header confusing the issue.

I tried on some of mine and some of my working mdadm's devices report
weird things.

On Wed, Jul 20, 2022 at 12:31 PM Nix <nix@xxxxxxxxxxxxx> wrote:
>
> On 19 Jul 2022, Guoqing Jiang spake thusly:
>
> > On 7/18/22 8:20 PM, Nix wrote:
> >> So I have a pair of RAID-6 mdraid arrays on this machine (one of which
> >> has a bcache layered on top of it, with an LVM VG stretched across
> >> both). Kernel 5.16 + mdadm 4.0 (I know, it's old) works fine, but I just
> >> rebooted into 5.18.12 and it failed to assemble. mdadm didn't display
> >> anything useful: an mdadm --assemble --scan --auto=md --freeze-reshape
> >> simply didn't find anything to assemble, and after that nothing else was
> >> going to work. But rebooting into 5.16 worked fine, so everything was
> >> (thank goodness) actually still there.
> >>
> >> Alas I can't say what the state of the blockdevs was (other than that
> >> they all seemed to be in /dev, and I'm using DEVICE partitions so they
> >> should all have been spotte
> >
> > I suppose the array was built on top of partitions, then my wild guess is
> > the problem is caused by the change in block layer (1ebe2e5f9d68?),
> > maybe we need something similar in loop driver per b9684a71.
> >
> > diff --git a/drivers/md/md.c b/drivers/md/md.c
> > index c7ecb0bffda0..e5f2e55cb86a 100644
> > --- a/drivers/md/md.c
> > +++ b/drivers/md/md.c
> > @@ -5700,6 +5700,7 @@ static int md_alloc(dev_t dev, char *name)
> >         mddev->queue = disk->queue;
> >         blk_set_stacking_limits(&mddev->queue->limits);
> >         blk_queue_write_cache(mddev->queue, true, true);
> > +       set_bit(GD_SUPPRESS_PART_SCAN, &disk->state);
> >         disk->events |= DISK_EVENT_MEDIA_CHANGE;
> >         mddev->gendisk = disk;
> >         error = add_disk(disk);
>
> I'll give it a try. But... the arrays, fully assembled:
>
> Personalities : [raid0] [raid6] [raid5] [raid4]
> md125 : active raid6 sda3[0] sdf3[5] sdd3[4] sdc3[2] sdb3[1]
>       15391689216 blocks super 1.2 level 6, 512k chunk, algorithm 2 [5/5] [UUUUU]
>
> md126 : active raid6 sda4[0] sdf4[5] sdd4[4] sdc4[2] sdb4[1]
>       7260020736 blocks super 1.2 level 6, 512k chunk, algorithm 2 [5/5] [UUUUU]
>       bitmap: 0/2 pages [0KB], 1048576KB chunk
>
> md127 : active raid0 sda2[0] sdf2[5] sdd2[3] sdc2[2] sdb2[1]
>       1310064640 blocks super 1.2 512k chunks
>
> unused devices: <none>
>
> so they are on top of partitions. I'm not sure suppressing a partition
> scan will help... but maybe I misunderstand.
>
> --
> NULL && (void)