Re: regression: drive was detected as raid member due to metadata on partition

Mariusz Tkaczyk <mariusz.tkaczyk@xxxxxxxxxxxxxxx> · Wed, 12 Jun 2024 16:36:16 +0200

On Wed, 29 May 2024 00:57:17 +0200
Sven Köhler <sven.koehler@xxxxxxxxx> wrote:

> Hi Mariusz,
> 
> Am 07.05.24 um 09:32 schrieb Mariusz Tkaczyk:
> > On Tue, 9 Apr 2024 01:31:35 +0200
> > Sven Köhler <sven.koehler@xxxxxxxxx> wrote:
> >   
> >> I strongly believe that mdadm should ignore any metadata - regardless of
> >> the version - that is at a location owned by any of the partitions.  
> > 
> > That would require mdadm to understand gpt parttable, not only clone it.
> > We have gpt support to clone the gpt metadata( see super-gpt.c).
> > It should save us from such issues so you have my ack if you want to do
> > this.  
> 
> I get your point. That seems wrong to me. I wonder whether the kernel 
> has some interface to gather information on partitions on a device. 
> After all, the kernel knows lots of partition table types (mbr, gpt, ...)

Hi Sven,
It might be to early to rely on kernel. Kernel initialized partitions on open
(generally caused by udev) and at the same call mdadm is called by udev, so the
partition may or may not be there (in sysfs). I think, there could be race
possibility.

That is what I remember but I was there few years ago. I hope it is helpful.

> 
> > But... GPT should have secondary header located at the end of the device, so
> > your metadata should be not at the end. Are you using gpt or mbr parttable?
> > Maybe missing secondary gpt header is the reason?  
> 
> I just checked. My disks don't have a GPT backup at the end. I might 
> have converted an MBR partition table to a GPT. That would not create a 
> backup GPT if the space is already occupied by a partition.
> 
> That said, for the sake of argument, I might just as well be using an 
> MBR partition table.

Yeah, make sense.

> 
> >> While I'm not 100% sure how to implement that, the following might also
> >> work: first scan the partitions for metadata, then ignore if the parent
> >> device has metadata with a UUID previously found.  
> > 
> > No, it is not an option. In udev world, you should only operate on device
> > you are processing so we should avoid referencing the system.  
> 
> Hmm, I think I know what you mean.
> 
> > BTW. To avoid this issue you can left few bytes empty at the end of disk,
> > simply make your last partition ended few bytes before end of the drive.
> > With that metadata will not be recognized directly on the drive. That is at
> > least what I expected but I'm not native experienced so please be aware of
> > that.  
> 
> I verified that my last partition ends at the last sector of the disc. 
> Pretty sure that means it must have been an MBR PT once upon a time.
> 
> This is not about me. I'm not asking to support my case for the sake of 
> having my system work. I already converted to metadata 1.2 and that 
> fixed the issue regardless where the last partition ends.
> 
> It's a regression, in the sense that my system has worked for years and 
> after an upgrade suddenly didn't. I'd like to prevent that the same 
> happens to others. It was pretty scary, even though no data seems to 
> have been lost.

Great open source attitude!

> 
> >> I did the right thing and converted my RAID arrays to metadata 1.2, but
> >> I'd like to save other from the adrenaline shock.  
> > 
> > There are reasons why we introduced v1.2 located at the begging of device.
> > You can try to fix it but I think that you should just follow upstream and
> > choose 1.2 if you can.  
> 
> Yes, I agree with you. That's why I migrated to 1.2 already.
> 
> > As we are more and more with 1.2 that naturally we care less about 0.9,
> > especially of workarounds in other utilities. We cannot control
> > if legacy workarounds are still there (the root cause of this change may be
> > outside md/mdadm, you never know :)).  
> 
> Likely, the reason is outside of the mdadm binary but inside the mdadm 
> repo. Arch Linux uses the udev rules provided by the mdadm package 
> without modification. The diff on the udev rules between mdadm 4.2 and 
> 4.3 release is significant. Both invoke mdadm -If $name but likely the 
> order has changed.
> 
> An investigation of that is still pending. I'm not an expert in udev 
> debugging, and the logs don't show.

Slowly you will figure it out. I debugged udev few times but every time I
make something wrong and it is not working :)

> 
> > So the cases like that will always come. It is right to use 1.2 now to be
> > better supported if you don't have strong need to stay with 0.9.  
> 
> Would it be possible to have automated tests for incremental raid 
> assembly via udev rules? I'm not an expert in udev though.

yes, it is possible. The simplest way it to synthesize "add" event, for example:
echo add > /sys/block/nvme1n1/uevent

I don't know if it is reliable way, but I'm using it time to time.
mdadm does it this way too.

Mariusz

> 
> 
> > Anyway, patches are always welcomed!  
> 
> Still working on my udev debugging skills. But afterwards, I may very 
> well prepare a patch.
> 
> 
> 
> Best,
>    Sven