Re: The dev node can't be released at once after stopping raid

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




----- Original Message -----
> From: "NeilBrown" <neilb@xxxxxxxx>
> To: "Xiao Ni" <xni@xxxxxxxxxx>
> Cc: linux-raid@xxxxxxxxxxxxxxx
> Sent: Thursday, August 31, 2017 12:36:08 PM
> Subject: Re: The dev node can't be released at once after stopping raid
> 
> On Wed, Aug 30 2017, Xiao Ni wrote:
> 
> > Hi Neil
> >
> > I have searched in history emails and there have many topics like this.
> > Sorry for talking
> > about this again. But it looks like the situation I encountered is
> > different. There is 1 second
> > window between stop the raid device and delete the node /dev/md0. The
> > /dev/md0 node can be
> > removed successfully after 1 second.
> 
> I think you are saying that /dev/md0 gets deleted 1 second after the
> device is stopped.  I assume that is a delay in udev processing of
> events.
> 
> When you say "can be"  I assume you mean "is being".
> ie. if you say
>    "The node can be removed after 1 second", it seems to imply that if
>    you try to remove it earlier, the unlink() will fail.
> If you say
>   "The node is being removed after 1 seconds", that suggests that the
>   removal happens automatically, but there is a delay between the device
>   stopping and the removal happening.

Yes, it's this situation. The node is being removed after 1 seconds.

> 
> >
> > There is no process that open the /dev/md0 after mdadm -S /dev/md0:
> >
> > mdadm -CR /dev/md0 -l1 -n2 /dev/loop0 /dev/loop1 --assume-clean
> > dmesg:
> > [36416.860525] Opened by mdadm, pid is 3523
> > [36416.984160] md/raid1:md0: active with 2 out of 2 mirrors
> > [36416.984181] md0: detected capacity change from 0 to 523239424
> > [36416.984219] Released by mdadm, pid is 3523
> > [36416.984228] remove_and_add_spares
> > [36416.991588] Opened by mdadm, pid is 3541
> > [36416.997183] Released by mdadm, pid is 3541
> > [36417.001376] Opened by systemd-udevd, pid is 3525
> > [36417.007128] Released by systemd-udevd, pid is 3525
> >
> > udev:
> > KERNEL[36419.830817] add      /devices/virtual/bdi/9:0 (bdi)
> > KERNEL[36419.831045] add      /devices/virtual/block/md0 (block)
> > UDEV  [36419.832911] add      /devices/virtual/bdi/9:0 (bdi)
> > UDEV  [36419.836380] add      /devices/virtual/block/md0 (block)
> > KERNEL[36419.877705] change   /devices/virtual/block/loop0 (block)
> > KERNEL[36419.878057] change   /devices/virtual/block/loop0 (block)
> > KERNEL[36419.926761] change   /devices/virtual/block/loop1 (block)
> > KERNEL[36419.927015] change   /devices/virtual/block/loop1 (block)
> > UDEV  [36419.953112] change   /devices/virtual/block/loop0 (block)
> > UDEV  [36419.953141] change   /devices/virtual/block/loop1 (block)
> > KERNEL[36419.954765] change   /devices/virtual/block/md0 (block)
> > UDEV  [36419.955973] change   /devices/virtual/block/loop0 (block)
> > UDEV  [36419.962799] change   /devices/virtual/block/loop1 (block)
> > UDEV  [36419.982934] change   /devices/virtual/block/md0 (block)
> >
> > mdadm -S /dev/md0
> > dmesg:
> > [36493.068054] Opened by mdadm, pid is 3552
> > [36493.072051] Released by mdadm, pid is 3552
> > [36493.076123] Opened by mdadm, pid is 3552
> > [36493.080073] md0: detected capacity change from 523239424 to 0
> > [36493.080077] md: md0 stopped.
> > [36493.273011] Released by mdadm, pid is 3552
> > udev:
> > KERNEL[36496.300219] remove   /devices/virtual/bdi/9:0 (bdi)
> > KERNEL[36496.300335] remove   /devices/virtual/block/md0 (block)
> > UDEV  [36496.300736] remove   /devices/virtual/bdi/9:0 (bdi)
> > UDEV  [36496.301812] remove   /devices/virtual/block/md0 (block)
> 
> I don't see any 1 second delay here.
> I can see a 3 second delay between "Released by mdadm, pid = 3552" and
> the UDEV remove event.  Is that what you are referring to?

Ah, how do you calculate 3 second? 36496 - 36493? 

I did the test again, the dmesg and udev are:
dmesg
[ 2988.821730] Opened by mdadm, pid is 3174
[ 2988.825827] Released by mdadm, pid is 3174
[ 2988.830112] Opened by mdadm, pid is 3174
[ 2988.834200] md: md0 stopped.
[ 2988.834397] Released by mdadm, pid is 3174
udev
KERNEL[2989.150258] remove   /devices/virtual/bdi/9:0 (bdi)
KERNEL[2989.150334] remove   /devices/virtual/block/md0 (block)
UDEV  [2989.150491] remove   /devices/virtual/bdi/9:0 (bdi)
UDEV  [2989.151587] remove   /devices/virtual/block/md0 (block)


The test script is:
[root@dell-per210-01 ~]# cat test.sh 
#!/bin/sh
mdadm -CR /dev/md0 -l1 -n2 /dev/loop0  /dev/loop1 --assume-clean
mdadm -S /dev/md0
ls /dev/md0
sleep 1
ls /dev/md0

The result is:
[root@dell-per210-01 ~]# sh test.sh 
...
mdadm: stopped /dev/md0
/dev/md0
ls: cannot access /dev/md0: No such file or directory

> 
> >
> > There are only REMOVE events during command mdadm -S /dev/md0.
> 
> The remove events seems to happen *after* "mdadm -S /dev/md0", or did
> "mdadm -S /dev/md0" take 3 seconds to run?
> 
> >
> > I tried to create a lvm and remove it to check whether lvm has this problem
> > or not.
> >
> > pvcreate /dev/md0
> > vgcreate vg /dev/md0
> > lvcreate -L 100M -n test vg
> > lvremove vg/test -y
> > ls /dev/mapper/vg-test
> > ls /dev/dm-3
> >
> > The node /dev/mapper/vg-test and /dev/dm-3 can be removed in time. There is
> > no time
> > window. So it looks like it's a problem of md. Could you give some
> > suggestions about
> > this? What should I do next?
> 
> Maybe lvremove explicitly unlinks the files in /dev, I don't know.

I did a test. mdadm unlink /run/mdadm/map.lock during mdadm -S. Can mdadm
unlink explicitly too? I added this line and this problem can be fixed.

diff --git a/Manage.c b/Manage.c
index b82a729..04994b3 100644
--- a/Manage.c
+++ b/Manage.c
@@ -482,6 +482,7 @@ done:
        map_lock(&map);
        map_remove(&map, devnm);
        map_unlock(&map);
+       unlink(devname);
 out:
        sysfs_free(mdi);
 

> 
> >
> > If it's not a bug, why there is a 1 second window?
> 
> As I said, probably because udev is slow.
> Why do you think this is a problem?  Why do you care about 1 second
> window.  If I don't know how why this matters, I cannot help you.

There is a bug https://bugzilla.redhat.com/show_bug.cgi?id=1444434. 
Another tool(blivet) stops raid device and the device node still exists.
Then it calls mdadm -S xxx again and it fails. So I ask myself why
/dev/mdxxx can't be removed immediately after command mdadm -S. 

In topic "MD Remnants After –stop", you said the REMOVE events are 
generated by "md_free() -> del_gendisk() ->  blk_unregister_queue()".
When mdadm -S return, the REMOVE events should be generated already,
right?

I always have a question. Who is responsible for removing the device
node under /dev/ directory? The function unlink()?

> 
> NeilBrown
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux RAID Wiki]     [ATA RAID]     [Linux SCSI Target Infrastructure]     [Linux Block]     [Linux IDE]     [Linux SCSI]     [Linux Hams]     [Device Mapper]     [Device Mapper Cryptographics]     [Kernel]     [Linux Admin]     [Linux Net]     [GFS]     [RPM]     [git]     [Yosemite Forum]


  Powered by Linux