Re: MD: Long delay for container drive removal

Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> · Tue, 25 Jun 2024 17:25:07 +0800

Hi,

在 2024/06/24 15:14, Mariusz Tkaczyk 写道:
On Thu, 20 Jun 2024 14:43:50 +0200
Mateusz Kusiak <mateusz.kusiak@xxxxxxxxxxxxxxx> wrote:

On 18.06.2024 16:24, Mateusz Kusiak wrote:
Hi all,
we have an issue submitted for SLES15SP6 that is caused by huge delays when
trying to remove drive from a container.

The scenario is as follows:
1. Create two drive imsm container
# mdadm --create --run /dev/md/imsm --metadata=imsm --raid-devices=2
/dev/nvme[0-1]n1 2. Remove single drive from container
# mdadm /dev/md127 --remove /dev/nvme0n1

The problem is that drive removal may take up to 7 seconds, which causes
timeouts for other components that are mdadm dependent.

We narrowed it down to be MD related. We tested this with inbox mdadm-4.3
and mdadm-4.2 on SP6 and delay time is pretty much the same. SP5 is free of
this issue.

I also tried RHEL 8.9 and drive removal is almost instant.

Is it default behavior now, or should we treat this as an issue?

Thanks,
Mateusz
   

I dug into this more. I retested this on:
- Ubuntu 24.04 with inbox kernel 6.6.0: No reproduction
- RHEL 9.4 with usptream kernel: 6.9.5-1: Got reproduction
(Note that SLES15SP6 comes with 6.8.0-rc4 inbox)

I plugged into mdadm with gdb and found out that ioctl call in
hot_remove_disk() fails and it's causing a delay. The function looks as
follows:

int hot_remove_disk(int mdfd, unsigned long dev, int force)
{
	int cnt = force ? 500 : 5;
	int ret;

	/* HOT_REMOVE_DISK can fail with EBUSY if there are
	 * outstanding IO requests to the device.
	 * In this case, it can be helpful to wait a little while,
	 * up to 5 seconds if 'force' is set, or 50 msec if not.
	 */
	while ((ret = ioctl(mdfd, HOT_REMOVE_DISK, dev)) == -1 &&
	       errno == EBUSY &&
	       cnt-- > 0)
		sleep_for(0, MSEC_TO_NSEC(10), true);

	return ret;
}
... if it fails, then it defaults to removing drive via sysfs call.

Looks like a kernel ioctl issue...


Hello,
I investigated this. Looks like HOT_REMOVE_DRIVE ioctl almost always failed for
raid with no raid personality. At some point it was allowed but it was blocked
6 years ago in c42a0e2675 (this id leads to merge commit, so giving title "md:
fix NULL dereference of mddev->pers in remove_and_add_spares()").

And that explains why we have outdated comment in mdadm:

		if (err && errno == ENODEV) {
			/* Old kernels rejected this if no personality
			 * is registered */

I'm working to make it fixed in mdadm (for kernels with this hang), I will
remove ioctl call for external containers:
https://github.com/md-raid-utilities/mdadm/pull/31

On HOT_REMOVE_DRIVE ioctl path, there is a wait for clearing MD_RECOVERY_NEEDED
flag with timeout set to 5 seconds. When I disabled this for arrays
with no personality- it fixes issue. However, I'm not sure if it is right fix. I
would expect to not set MD_RECOVERY_NEEDED for arrays with no MD personality.
Kuai and Song could you please advice?

diff --git a/drivers/md/md.c b/drivers/md/md.c
index c0426a6d2fd1..bd1cedeb105b 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -7827,7 +7827,7 @@ static int md_ioctl(struct block_device *bdev, blk_mode_t
mode, return get_bitmap_file(mddev, argp);
         }

-       if (cmd == HOT_REMOVE_DISK)
+       if (cmd == HOT_REMOVE_DISK && mddev->pers)

This patch will work, however, I'm afraid this can't fix the problem
thoroughly, because this is called without 'reconfig_mutex', and
mddev->pers can be set later before hot_remove_disk().

After taking a look at the commit 90f5f7ad4f38, which introducing the
waiting, was trying to wait for a failed device to be removed by
md_check_recovery().  However, this doen't make sense now, because
remove_and_add_spares() is called diretcly from hot_remove_disk(),
hence failed device can be removed directly from ioctl, without
md_check_recovery().

The only thining to prevent a failed device to be removed from array
wound be MD_RECOVERY_RUNNING, however, we can't wait for this falg to be
cleared, hence I'll suggest to revert this patch.

Thanks,
Kuai

                 /* need to ensure recovery thread has run */
                 wait_event_interruptible_timeout(mddev->sb_wait,
                                                  !test_bit(MD_RECOVERY_NEEDED,


Thanks,
Mariusz

.