在 2024/9/3 下午3:22, Mariusz Tkaczyk 写道:
On Tue, 3 Sep 2024 08:53:42 +0800
Xiao Ni <xni@xxxxxxxxxx> wrote:
On Mon, Sep 2, 2024 at 6:14 PM Mariusz Tkaczyk
<mariusz.tkaczyk@xxxxxxxxxxxxxxx> wrote:
On Wed, 28 Aug 2024 10:11:44 +0800
Xiao Ni <xni@xxxxxxxxxx> wrote:
It needs to remove disks when reshaping from raid456 to raid0. In
kernel space it sets MD_RECOVERY_RUNNING. And it will fail to change
level. So wait sometime to let md thread to clear this flag.
This is found by test case 05r6tor0.
Signed-off-by: Xiao Ni <xni@xxxxxxxxxx>
---
Grow.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/Grow.c b/Grow.c
index 2a7587315817..aaf349e9722f 100644
--- a/Grow.c
+++ b/Grow.c
@@ -3028,6 +3028,12 @@ static int impose_level(int fd, int level, char
*devname, int verbose) makedev(disk.major, disk.minor));
hot_remove_disk(fd, makedev(disk.major, disk.minor),
1); }
+ /*
+ * hot_remove_disk lets kernel set MD_RECOVERY_RUNNING
+ * and it can't set level. It needs to wait sometime
+ * to let md thread to clear the flag.
+ */
+ sleep_for(5, 0, true);
Hi Mariusz
Shouldn't we check sysfs is shorter intervals? I know that is the simplest
way but big sleeps are generally not good.
I will merge it if you don't want to rework it but you need to add log that
we are waiting 5 second for the user to not panic that it is frozen.
Which sysfs do you mean? If we have a better way, I want to choose it.
If we are sending hot remove to the disk, we can check if there is path
available: /sys/block/<mddev>/md/dev-{devnm}
if not, then device has been finally removed.
Eventually, we can see same in mdstat but checking path looks simpler to me.
Thanks,
Mariusz
Hi Mariusz
I check you method and it doesn't work. There are two steps in kernel
space and they are async.
1. remove disk including remove the sysfs directory, set
MD_RECOVERY_NEEDED and wake up md thread
2. Because MD_RECOVERY_NEEDED is set, kernel space sets
MD_RECOVERY_RUNNING and queue a sync work. It doesn't do anything and
clear MD_RECOVERY_RUNNING
So there is a time window. It depends on machines. Sometimes it fails
when setting new level because MD_RECOVERY_RUNNING is set. Maybe we can
add some check when removing disk. If it doesn't need to do
sync/recovery, we don't need to set MD_RECOVERY_NEEDED. But now, we can
add a sleep here as a solution. I'll add a log here to give admin.
Best Regards
Xiao