Re: REGRESSION: [PATCH 4/4] block: freeze the queue earlier in del_gendisk

Logan Gunthorpe <logang@xxxxxxxxxxxx> · Fri, 8 Jul 2022 09:55:55 -0600

On 2022-07-08 00:01, Christoph Hellwig wrote:
> On Thu, Jul 07, 2022 at 11:41:40PM -0600, Logan Gunthorpe wrote:
>> I'm not really sure why this is yet, but this patch in rc4 causes some
>> random failures with mdadm tests.
>>
>> It seems the 11spare-migration tests starts failing roughly every other
>> run because the block device is not quite cleaned up after mdadm --stop
>> by the time the next mdadm --create commands starts, or rather there
>> appears to be a race now between the newly created device and the one
>> being cleaned up. This results in an infrequent sysfs panic with a
>> duplicate filename error (see the end of this email).
>>
>> I managed to bisect this and found a09b314005f3a09 to be the problematic
>> commit.
> 
> Taking a look at the mddev code this commit just seems to increase the
> race window of hitting horrible life time problems in md, but I'll also
> try to reproduce and verify it myself.
> 
> Take a look at how md searches for a duplicate name in md_alloc,
> mddev_alloc_unit and mddev_find_locked based on the all_mddevs list,
> and how the mddev gets dropped from all_mddevs very early and long
> before the gendisk is gone in mddev_put.  I think what needs to be
> done is to implement a free_disk method and drop the mddev (and free it)
> from that.  But given how much intricate mess is based on all_mddevs
> we'll have to be very careful about that.

I agree it's a mess, probably buggy and could use a cleanup with a
free_disk method. But I'm not sure the all_mdevs lifetime issues are the
problem here. If the entry in all_mdevs outlasts the disk, then
md_alloc() will just fail earlier. Many test scripts rely on the fact
that you can stop an mddev and recreate it immediately after. We need
some way of ensuring any deleted disks are fully deleted before trying
to make a new mddev, in case the new one has the same name as one being
deleted.

The md code deletes the disk in md_delayed_delete(), a work queue item
on md_misc_wq. That queue is flushed first in md_misc_wq, but somehow,
some of the disk is still not fully deleted by the time
flush_workqueue() returns. I'm not sure why that would be.

Logan