Re: [QUESTION] How to fix the race of "mdadm --add" and "mdadm mdadm --incremental --export"

Li Xiao Keng <lixiaokeng@xxxxxxxxxx> · Wed, 15 Mar 2023 21:10:59 +0800

On 2023/3/15 18:10, Mariusz Tkaczyk wrote:
> On Tue, 14 Mar 2023 17:11:06 +0100
> Martin Wilck <mwilck@xxxxxxxx> wrote:
> 
>> On Tue, 2023-03-14 at 16:59 +0100, Mariusz Tkaczyk wrote:
>>> On Tue, 14 Mar 2023 16:04:23 +0100
>>> Martin Wilck <mwilck@xxxxxxxx> wrote:
>>>   
>>>> On Tue, 2023-03-14 at 22:58 +0800, Li Xiao Keng wrote:  
>>>>> Hi,
>>>>>    Here we meet a question. When we add a new disk to a raid, it
>>>>> may
>>>>> return
>>>>> -EBUSY.
>>>>>    The main process of --add（for example md0, sdf):
>>>>>        1.dev_open(sdf)
>>>>>        2.add_to_super
>>>>>        3.write_init_super
>>>>>        4.fsync(fd)
>>>>>        5.close(fd)
>>>>>        6.ioctl(ADD_NEW_DISK).
>>>>>    However, there will be some udev(change of sdf) event after
>>>>> step5.
>>>>> Then
>>>>> "/usr/sbin/mdadm --incremental --export $devnode --offroot
>>>>> $env{DEVLINKS}"
>>>>> will be run, and the sdf will be added to md0. After that, step6
>>>>> will
>>>>> return
>>>>> -EBUSY.
>>>>>    It is a problem to user. First time adding disk does not
>>>>> return
>>>>> success
>>>>> but disk is actually added. And I have no good idea to deal with
>>>>> it.
>>>>> Please
>>>>> give some great advice.    
>>>>
>>>> I haven't looked at the code in detail, but off the top of my head,
>>>> it
>>>> should help to execute step 5 after step 6. The close() in step 5
>>>> triggers the uevent via inotify; doing it after the ioctl should
>>>> avoid
>>>> the above problem.  
>>> Hi,
>>> That will result in EBUSY in everytime. mdadm will always handle
>>> descriptor and kernel will refuse to add the drive.  
>>
>> Why would it cause EBUSY? Please elaborate. My suggestion would avoid
>> the race described by Li Xiao Keng. I only suggested to postpone the
>> close(), nothing else. The fsync() would still be done before the
>> ioctl, so the metadata should be safely on disk when the ioctl is run.
> 
> Because device will be claimed in userspace by mdadm. MD may check if device
> is not claimed. I checked bind_rdev_to_array() and I don't find an obvious
> answer, so I could be wrong here.
> 
  I test move close() after ioctl(). The reason of EBUSY is that mdadm
open(sdf) with O_EXCL. So fd should be closed before ioctl. When I remove
O_EXCL, ioctl() will return success. I guess MD check if device is not
claimed in md_import_device()->lock_rdev()->blkdev_get_by_dev()->blkdev_get().

> Maybe someone more kernel experienced can speak here? Song, could you look?
> 
> I think that the descriptor opened by incremental block kernel from adding the
> device to the array but also the same incremental is able to add it later
> because metadata is on device. There is no retry in Manage_add() flow so it
> must be added by Incremental so the question is if it is already in
> array or disk is just claimed and that is the problem.
> 
> Eventually, you can test your proposal that should gives us an answer.
> 
>>
>> This is a recurring pattern. Tools that manipulate block devices must
>> be aware that close-after-write triggers an uevent, and should make
>> sure that they don't close() such files prematurely.
> 
> Agree. Mdadm has this problem, descriptors are opened and closed may times.

Yes, there is closing descriptors in write_init_super1, Kill(),
tst->ss->free_super(tst) between fsync() and ioctl().

>>
>>>>
>>>> Another obvious workaround in mdadm would be to check the state of
>>>> the
>>>> array in the EBUSY case and find out that the disk had already been
>>>> added.
>>>>
>>>> But again, this was just a high-level guess.
>>>>
>>>> Martin
>>>>   
>>>
>>> Hmm... I'm not a native expert but why we cannot write metadata after
>>> adding
>>> drive to array? Why kernel can't handle that?
>>>
> 
> Ok, there is an assumption in kernel that device MUST have metadata.
> 
>>> Ideally, we should lock device and block udev- I know that there is
>>> flock
>>> based API to do that but I'm not sure if flock() won't cause the same
>>> problem.  
>>
>> That doesn't work reliably. At least not in general. The mechanmism is
>> disabled for for dm devices (e.g. multipath), for example. See
>> https://github.com/systemd/systemd/blob/a5c0ad9a9a2964079a19a1db42f79570a3582bee/src/udev/udevd.c#L483
>>
> Yeah, I know but udev processes the disk, not MD device so the locking
> should work. But if we cannot trust it, we shouldn't follow this idea.
>>  
>>> There is also something like "udev-md-raid-creating.rules". Maybe we
>>> can reuse
>>> it?
>>>   
>>
>> Unless I am mistaken, these rules are exactly those that cause the
>> issue we are discussing. Since these rules are also part of the mdadm
>> package, it might be possible to set some flag under /run that would
>> indicate to the rules that auto-assembly should be skipped. But that
>> might be racy, too.
> 
> Yeah, bad idea. Agree.
> 
> New day, new ideas:
> - why we cannot let udev to add the device? just write metadata and finish,
>   udev should handle that because metadata is there.
  If we let udev to add the device, We cannot determine whether the disk is
added successfully from the command ("--add") return value. For example,
writing metadata succeeded, but udev failed.
> 
> - incremental uses map_lock() to prevent concurrent updates, I seems to b
>   missed for --add. That could be a key to prevent the behavior.Incremental is
>   checking if it can lock the map file. If not, it finishes gracefully. To lock
>   array we need to read metadata first, so it goes to the first question- is it
>   a problem that incremental has opened descriptor?
> 
> Thanks,
> Mariusz
> .
>