Re: [PATCH v3] Fix race of "mdadm --add" and "mdadm --incremental"

Mariusz Tkaczyk <mariusz.tkaczyk@xxxxxxxxxxxxxxx> · Mon, 11 Sep 2023 18:00:32 +0200

On Thu, 7 Sep 2023 19:37:44 +0800
Li Xiao Keng <lixiaokeng@xxxxxxxxxx> wrote:

> There is a raid1 with sda and sdb. And we add sdc to this raid,
> it may return -EBUSY.
> 
> The main process of --add:
> 1. dev_open（sdc) in Manage_add
> 2. store_super1(st, di->fd) in write_init_super1
> 3. fsync(fd) in store_super1
> 4. close(di->fd) in write_init_super1
> 5. ioctl(ADD_NEW_DISK)
> 
> Step 2 and 3 will add sdc to metadata of raid1. There will be
> udev(change of sdc) event after step4. Then "/usr/sbin/mdadm
> --incremental --export $devnode --offroot $env{DEVLINKS}"
> will be run, and the sdc will be added to the raid1. Then
> step 5 will return -EBUSY because it checks if device isn't
> claimed in md_import_device()->lock_rdev()->blkdev_get_by_dev()
> ->blkdev_get().  
> 
> It will be confusing for users because sdc is added first time.
> The "incremental" will get map_lock before add sdc to raid1.
> So we add map_lock before write_init_super in "mdadm --add"
> to fix the race of "add" and "incremental".
> 
> Signed-off-by: Li Xiao Keng <lixiaokeng@xxxxxxxxxx>
> Signed-off-by: Guanqin Miao <miaoguanqin@xxxxxxxxxx>

Hello Li Xiao Keng,
Thanks for the patch, I'm trying to get my head around this.
As I understand, you added a mapfile locking and you are failing when the log
is already claimed, but you prefer to return -1 and print error instead of
returning EBUSY, right?

If yes, it has nothing to do with fixing. It is just a obscure hack to make it,
let say more reliable. We can either get EBUSY and know that is expected. I
see no difference, your solution is same confusing as current implementation.

Also, you used map_lock(). I remember that initially I proposed - my bad.
We also rejected udev locking, as a buggy. In fact- I don't see ready
synchronization method to avoid this race in mdadm right now.

The best way, I see is:
1. Write metadata.
2. Give udev chance to process the device.
3. Add it manually if udev failed (or D_NO_UDEV).

If it does not work for you, we need to add synchronization, I know that there
is pidfile used for mdmon, perhaps we can reuse it? We don't need something
fancy, just something to keep all synchronised and block incremental to led
--add finish the action.

We cannot use map_lock() and map_unlock() in this context. It claims lock on
mapfile which describes every array in your system (base details, there is no
info about members), see /var/run/mdadm. Add is just a tiny action to add disk
and we are not going to update the map. Taking this lock will make --add
more vulnerable, for example it will fail because another array is assembled or
created in the same time (I know that it will be hard to achieve but it is
another race - we don't want it).

And we cannot allow for partially done action, you added this lock after
writing metadata (according to description), we are not taking this lock to
complete the action, just to hide race. Lock should be taken before
writing metadata, but we know that it will not hide an issue.

In this for it cannot be merged I don't see a gain, we are returning -1 instead
-22, what is the difference? I don't see error message as more meaningful now.

Thanks,
Mariusz