Re: [QUESTION] How to fix the race of "mdadm --add" and "mdadm mdadm --incremental --export"

Mariusz Tkaczyk <mariusz.tkaczyk@xxxxxxxxxxxxxxx> · Wed, 15 Mar 2023 11:10:27 +0100

On Tue, 14 Mar 2023 17:11:06 +0100
Martin Wilck <mwilck@xxxxxxxx> wrote:

> On Tue, 2023-03-14 at 16:59 +0100, Mariusz Tkaczyk wrote:
> > On Tue, 14 Mar 2023 16:04:23 +0100
> > Martin Wilck <mwilck@xxxxxxxx> wrote:
> >   
> > > On Tue, 2023-03-14 at 22:58 +0800, Li Xiao Keng wrote:  
> > > > Hi,
> > > >    Here we meet a question. When we add a new disk to a raid, it
> > > > may
> > > > return
> > > > -EBUSY.
> > > >    The main process of --add（for example md0, sdf):
> > > >        1.dev_open(sdf)
> > > >        2.add_to_super
> > > >        3.write_init_super
> > > >        4.fsync(fd)
> > > >        5.close(fd)
> > > >        6.ioctl(ADD_NEW_DISK).
> > > >    However, there will be some udev(change of sdf) event after
> > > > step5.
> > > > Then
> > > > "/usr/sbin/mdadm --incremental --export $devnode --offroot
> > > > $env{DEVLINKS}"
> > > > will be run, and the sdf will be added to md0. After that, step6
> > > > will
> > > > return
> > > > -EBUSY.
> > > >    It is a problem to user. First time adding disk does not
> > > > return
> > > > success
> > > > but disk is actually added. And I have no good idea to deal with
> > > > it.
> > > > Please
> > > > give some great advice.    
> > > 
> > > I haven't looked at the code in detail, but off the top of my head,
> > > it
> > > should help to execute step 5 after step 6. The close() in step 5
> > > triggers the uevent via inotify; doing it after the ioctl should
> > > avoid
> > > the above problem.  
> > Hi,
> > That will result in EBUSY in everytime. mdadm will always handle
> > descriptor and kernel will refuse to add the drive.  
> 
> Why would it cause EBUSY? Please elaborate. My suggestion would avoid
> the race described by Li Xiao Keng. I only suggested to postpone the
> close(), nothing else. The fsync() would still be done before the
> ioctl, so the metadata should be safely on disk when the ioctl is run.

Because device will be claimed in userspace by mdadm. MD may check if device
is not claimed. I checked bind_rdev_to_array() and I don't find an obvious
answer, so I could be wrong here.

Maybe someone more kernel experienced can speak here? Song, could you look?

I think that the descriptor opened by incremental block kernel from adding the
device to the array but also the same incremental is able to add it later
because metadata is on device. There is no retry in Manage_add() flow so it
must be added by Incremental so the question is if it is already in
array or disk is just claimed and that is the problem.

Eventually, you can test your proposal that should gives us an answer.

> 
> This is a recurring pattern. Tools that manipulate block devices must
> be aware that close-after-write triggers an uevent, and should make
> sure that they don't close() such files prematurely.

Agree. Mdadm has this problem, descriptors are opened and closed may times.
> 
> > > 
> > > Another obvious workaround in mdadm would be to check the state of
> > > the
> > > array in the EBUSY case and find out that the disk had already been
> > > added.
> > > 
> > > But again, this was just a high-level guess.
> > > 
> > > Martin
> > >   
> > 
> > Hmm... I'm not a native expert but why we cannot write metadata after
> > adding
> > drive to array? Why kernel can't handle that?
> > 

Ok, there is an assumption in kernel that device MUST have metadata.

> > Ideally, we should lock device and block udev- I know that there is
> > flock
> > based API to do that but I'm not sure if flock() won't cause the same
> > problem.  
> 
> That doesn't work reliably. At least not in general. The mechanmism is
> disabled for for dm devices (e.g. multipath), for example. See
> https://github.com/systemd/systemd/blob/a5c0ad9a9a2964079a19a1db42f79570a3582bee/src/udev/udevd.c#L483
> 
Yeah, I know but udev processes the disk, not MD device so the locking
should work. But if we cannot trust it, we shouldn't follow this idea.
>  
> > There is also something like "udev-md-raid-creating.rules". Maybe we
> > can reuse
> > it?
> >   
> 
> Unless I am mistaken, these rules are exactly those that cause the
> issue we are discussing. Since these rules are also part of the mdadm
> package, it might be possible to set some flag under /run that would
> indicate to the rules that auto-assembly should be skipped. But that
> might be racy, too.

Yeah, bad idea. Agree.

New day, new ideas:
- why we cannot let udev to add the device? just write metadata and finish,
  udev should handle that because metadata is there.

- incremental uses map_lock() to prevent concurrent updates, I seems to b
  missed for --add. That could be a key to prevent the behavior.Incremental is
  checking if it can lock the map file. If not, it finishes gracefully. To lock
  array we need to read metadata first, so it goes to the first question- is it
  a problem that incremental has opened descriptor?

Thanks,
Mariusz