Hello
I work on the famous "rbd: unmap failed: (16) Device or resource busy" which causes problems for some projects such as Incus (LXD), and many people just live with it... I post it on the ceph-user M/L and Murilo Morais works for several days with me on it.
<= Work with Stephane Graber INCUS/LXD main developer =>
I work on an easy reproducible bug setup without any other tool than stock ceph setup : create an image, map it, format&mount it, ADD AN OSD, and try to unmap the image ... tada ! "rbd: unmap failed: (16) Device or resource busy" is here.
Here's the more complete explanation, (who is the great job of Murilo Morais) :
I managed to reproduce.
The problem is how docker/podman binds "/" to "/rootfs" in containers.
When ceph creates the files for SystemD to start the services, it includes a system root bind to /rootfs [1], I do not recommend removing this bind, as it will break MON.
A quick alternative would be to change the unit files in /var/lib/ceph/<fsid>/<daemon>/ and add "slave" or "rslave" to the podman bind argument. Where it contains "-v /:/rootfs" add ":slave", leaving "-v /:/rootfs:slave". The inconvenience is that it will be necessary to restart all daemons, and, when adding/redeploying a daemon, you will have to perform the same steps.
A definitive solution would be to change the source code, but I didn't have time to try this option. Tomorrow, as soon as I get to the office, I will try to do this, I will report back to you as soon as I discover something!
-1 53.12473 root default
...
-5 3.63879 host ceph02-r2b-fl1
6 hdd 0.90970 osd.6 up 1.00000 1.00000
9 hdd 0.90970 osd.9 up 1.00000 1.00000
2 nvme 0.90970 osd.2 up 1.00000 1.00000
4 nvme 0.90970 osd.4 up 1.00000 1.00000
...
root@ceph02-r2b-fl1:~# rbd create image1 --size 1024 --pool customers-clouds.ix-mrs2.fr.eho
root@ceph02-r2b-fl1:~# RBD_DEVICE=$(rbd map customers-clouds.ix-mrs2.fr.eho/image1)
root@ceph02-r2b-fl1:~# mkfs.ext4 ${RBD_DEVICE}
mke2fs 1.47.0 (5-Feb-2023)
Discarding device blocks: done
Creating filesystem with 262144 4k blocks and 65536 inodes
Filesystem UUID: c97362e1-11db-4ff3-ba62-ede6d58884b9
Superblock backups stored on blocks:
32768, 98304, 163840, 229376
Allocating group tables: done
Writing inode tables: done
Creating journal (8192 blocks): done
Writing superblocks and filesystem accounting information: done
root@ceph02-r2b-fl1:~# mount ${RBD_DEVICE} /media/test
/dev/rbd4 on /var/lib/incus/storage-pools/default/containers/ec-xx type ext4 (rw,relatime,discard,stripe=16)
/dev/rbd5 on /var/lib/incus/storage-pools/default/containers/ec-xx type ext4 (rw,relatime,discard,stripe=16)
/dev/rbd2 on /var/lib/incus/storage-pools/default/containers/ec-xx type ext4 (rw,relatime,discard,stripe=16)
/dev/rbd1 on /var/lib/incus/storage-pools/default/containers/ec-xx type ext4 (rw,relatime,discard,stripe=16)
/dev/rbd3 on /var/lib/incus/storage-pools/default/containers/ec-xx type ext4 (rw,relatime,discard,stripe=16)
/dev/rbd0 on /var/lib/incus/storage-pools/default/containers/ec-xx type ext4 (rw,relatime,discard,stripe=16)
/dev/rbd6 on /media/test type ext4 (rw,relatime,stripe=16)
============================================================================================
===============> Adding the new OSD (an old small disk to be faster...) <===================
=========================== (via service task in dashboard) ================================
============================================================================================
root@xxxxxxxxxxxxxxxxxxxxxxx.eholab.admin:~# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 53.25822 root default
...
-5 3.77229 host ceph02-r2b-fl1
6 hdd 0.90970 osd.6 up 1.00000 1.00000
9 hdd 0.90970 osd.9 up 1.00000 1.00000
26 hdd 0.13350 osd.26 up 1.00000 1.00000 <=== Here's the brand new OSD
2 nvme 0.90970 osd.2 up 1.00000 1.00000
4 nvme 0.90970 osd.4 up 1.00000 1.00000
....
root@ceph02-r2b-fl1:~# podman ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
.....
6dbd4fd4e4f3 cephpodregistry:5000/ceph@sha256:e205163225ec8ce460d6581df66ba4866585e3a4817866910f85bedcdcff7935 -n osd.26 -f --se... 2 minutes ago Up 2 minutes ago ceph-c3f59906-c43d-11ee-a2d6-3a82cb8036b6-osd-26
root@xxxxxxxxxxxxxxxxxxxxxxx.eholab.admin:~# podman exec 6dbd4fd4e4f3 mount | grep rbd
/dev/rbd4 on /rootfs/var/lib/incus/storage-pools/default/containers/ec-xx type ext4 (rw,relatime,discard,stripe=16)
/dev/rbd5 on /rootfs/var/lib/incus/storage-pools/default/containers/ec-xx type ext4 (rw,relatime,discard,stripe=16)
/dev/rbd2 on /rootfs/var/lib/incus/storage-pools/default/containers/ec-xx type ext4 (rw,relatime,discard,stripe=16)
/dev/rbd1 on /rootfs/var/lib/incus/storage-pools/default/containers/ec-xx type ext4 (rw,relatime,discard,stripe=16)
/dev/rbd3 on /rootfs/var/lib/incus/storage-pools/default/containers/ec-xx type ext4 (rw,relatime,discard,stripe=16)
/dev/rbd0 on /rootfs/var/lib/incus/storage-pools/default/containers/ec-xx type ext4 (rw,relatime,discard,stripe=16)
/dev/rbd6 on /rootfs/media/test type ext4 (rw,relatime,stripe=16)
======= of course these mountpoints are not needed by the newly created OSD ... and this full copy is problematic ============
root@ceph02-r2b-fl1:~# umount /dev/rbd6
The problem is how docker/podman binds "/" to "/rootfs" in containers.
When ceph creates the files for SystemD to start the services, it includes a system root bind to /rootfs [1], I do not recommend removing this bind, as it will break MON.
By default they use "rprivate"
[2][3], it causes the mount points to be propagated to the container but
does not receive any "mount" or "umount" events from the host [4]. This
causes this behavior in your cluster.
This will happen whenever any container starts/restarts, regardless of whether it is a new daemon or not.
A quick alternative would be to change the unit files in /var/lib/ceph/<fsid>/<daemon>/ and add "slave" or "rslave" to the podman bind argument. Where it contains "-v /:/rootfs" add ":slave", leaving "-v /:/rootfs:slave". The inconvenience is that it will be necessary to restart all daemons, and, when adding/redeploying a daemon, you will have to perform the same steps.
A definitive solution would be to change the source code, but I didn't have time to try this option. Tomorrow, as soon as I get to the office, I will try to do this, I will report back to you as soon as I discover something!
If
you wish, you can respond to the public list about your problem and the
need to change the bind to rootfs, for everyone to see and so that some
of the project's devs can perhaps comment on something.
Have a good night!
[3] https://docs.podman.io/en/latest/markdown/podman-create.1.html#mount-type-type-type-specific-option
And the reproduction log is :
First I create a new image and map it (to be alone from Incus) :
root@ceph02-r2b-fl1:~# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF-1 53.12473 root default
...
-5 3.63879 host ceph02-r2b-fl1
6 hdd 0.90970 osd.6 up 1.00000 1.00000
9 hdd 0.90970 osd.9 up 1.00000 1.00000
2 nvme 0.90970 osd.2 up 1.00000 1.00000
4 nvme 0.90970 osd.4 up 1.00000 1.00000
...
root@ceph02-r2b-fl1:~# rbd create image1 --size 1024 --pool customers-clouds.ix-mrs2.fr.eho
root@ceph02-r2b-fl1:~# RBD_DEVICE=$(rbd map customers-clouds.ix-mrs2.fr.eho/image1)
root@ceph02-r2b-fl1:~# mkfs.ext4 ${RBD_DEVICE}
mke2fs 1.47.0 (5-Feb-2023)
Discarding device blocks: done
Creating filesystem with 262144 4k blocks and 65536 inodes
Filesystem UUID: c97362e1-11db-4ff3-ba62-ede6d58884b9
Superblock backups stored on blocks:
32768, 98304, 163840, 229376
Allocating group tables: done
Writing inode tables: done
Creating journal (8192 blocks): done
Writing superblocks and filesystem accounting information: done
root@ceph02-r2b-fl1:~# mount ${RBD_DEVICE} /media/test
Let's list current mapped devices.
root@ceph02-r2b-fl1:~# mount | grep rbd
/dev/rbd4 on /var/lib/incus/storage-pools/default/containers/ec-xx type ext4 (rw,relatime,discard,stripe=16)
/dev/rbd5 on /var/lib/incus/storage-pools/default/containers/ec-xx type ext4 (rw,relatime,discard,stripe=16)
/dev/rbd2 on /var/lib/incus/storage-pools/default/containers/ec-xx type ext4 (rw,relatime,discard,stripe=16)
/dev/rbd1 on /var/lib/incus/storage-pools/default/containers/ec-xx type ext4 (rw,relatime,discard,stripe=16)
/dev/rbd3 on /var/lib/incus/storage-pools/default/containers/ec-xx type ext4 (rw,relatime,discard,stripe=16)
/dev/rbd0 on /var/lib/incus/storage-pools/default/containers/ec-xx type ext4 (rw,relatime,discard,stripe=16)
/dev/rbd6 on /media/test type ext4 (rw,relatime,stripe=16)
============================================================================================
===============> Adding the new OSD (an old small disk to be faster...) <===================
=========================== (via service task in dashboard) ================================
============================================================================================
root@xxxxxxxxxxxxxxxxxxxxxxx.eholab.admin:~# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 53.25822 root default
...
-5 3.77229 host ceph02-r2b-fl1
6 hdd 0.90970 osd.6 up 1.00000 1.00000
9 hdd 0.90970 osd.9 up 1.00000 1.00000
26 hdd 0.13350 osd.26 up 1.00000 1.00000 <=== Here's the brand new OSD
2 nvme 0.90970 osd.2 up 1.00000 1.00000
4 nvme 0.90970 osd.4 up 1.00000 1.00000
....
============ Let's check what is NS contains ... =======================
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
.....
6dbd4fd4e4f3 cephpodregistry:5000/ceph@sha256:e205163225ec8ce460d6581df66ba4866585e3a4817866910f85bedcdcff7935 -n osd.26 -f --se... 2 minutes ago Up 2 minutes ago ceph-c3f59906-c43d-11ee-a2d6-3a82cb8036b6-osd-26
root@xxxxxxxxxxxxxxxxxxxxxxx.eholab.admin:~# podman exec 6dbd4fd4e4f3 mount | grep rbd
/dev/rbd4 on /rootfs/var/lib/incus/storage-pools/default/containers/ec-xx type ext4 (rw,relatime,discard,stripe=16)
/dev/rbd5 on /rootfs/var/lib/incus/storage-pools/default/containers/ec-xx type ext4 (rw,relatime,discard,stripe=16)
/dev/rbd2 on /rootfs/var/lib/incus/storage-pools/default/containers/ec-xx type ext4 (rw,relatime,discard,stripe=16)
/dev/rbd1 on /rootfs/var/lib/incus/storage-pools/default/containers/ec-xx type ext4 (rw,relatime,discard,stripe=16)
/dev/rbd3 on /rootfs/var/lib/incus/storage-pools/default/containers/ec-xx type ext4 (rw,relatime,discard,stripe=16)
/dev/rbd0 on /rootfs/var/lib/incus/storage-pools/default/containers/ec-xx type ext4 (rw,relatime,discard,stripe=16)
/dev/rbd6 on /rootfs/media/test type ext4 (rw,relatime,stripe=16)
======= of course these mountpoints are not needed by the newly created OSD ... and this full copy is problematic ============
And then :
root@ceph02-r2b-fl1:~# umount /dev/rbd6
-I can unmount the rbd device ... no reference is host NS-
root@ceph02-r2b-fl1:~# rbd unmap /dev/rbd6
rbd: sysfs write failed
rbd: unmap failed: (16) Device or resource busy
root@ceph02-r2b-fl1:~# podman exec 6dbd4fd4e4f3 mount | grep rbd
/dev/rbd4 on /rootfs/var/lib/incus/storage-pools/default/containers/ec-06c995c3 type ext4 (rw,relatime,discard,stripe=16)
/dev/rbd5 on /rootfs/var/lib/incus/storage-pools/default/containers/ec-0bc99da2 type ext4 (rw,relatime,discard,stripe=16)
/dev/rbd2 on /rootfs/var/lib/incus/storage-pools/default/containers/ec-62cc652e type ext4 (rw,relatime,discard,stripe=16)
/dev/rbd1 on /rootfs/var/lib/incus/storage-pools/default/containers/ec-5dcc5d4f type ext4 (rw,relatime,discard,stripe=16)
/dev/rbd3 on /rootfs/var/lib/incus/storage-pools/default/containers/ec-efd3feea type ext4 (rw,relatime,discard,stripe=16)
/dev/rbd0 on /rootfs/var/lib/incus/storage-pools/default/containers/ec-59cc5703 type ext4 (rw,relatime,discard,stripe=16)
/dev/rbd6 on /rootfs/media/test type ext4 (rw,relatime,stripe=16)
... Because OSD's Namespace still had the mount (as all rbd mapped...)
root@ceph02-r2b-fl1:~# rbd unmap /dev/rbd6
rbd: sysfs write failed
rbd: unmap failed: (16) Device or resource busy
root@ceph02-r2b-fl1:~# podman exec 6dbd4fd4e4f3 mount | grep rbd
/dev/rbd4 on /rootfs/var/lib/incus/storage-pools/default/containers/ec-06c995c3 type ext4 (rw,relatime,discard,stripe=16)
/dev/rbd5 on /rootfs/var/lib/incus/storage-pools/default/containers/ec-0bc99da2 type ext4 (rw,relatime,discard,stripe=16)
/dev/rbd2 on /rootfs/var/lib/incus/storage-pools/default/containers/ec-62cc652e type ext4 (rw,relatime,discard,stripe=16)
/dev/rbd1 on /rootfs/var/lib/incus/storage-pools/default/containers/ec-5dcc5d4f type ext4 (rw,relatime,discard,stripe=16)
/dev/rbd3 on /rootfs/var/lib/incus/storage-pools/default/containers/ec-efd3feea type ext4 (rw,relatime,discard,stripe=16)
/dev/rbd0 on /rootfs/var/lib/incus/storage-pools/default/containers/ec-59cc5703 type ext4 (rw,relatime,discard,stripe=16)
/dev/rbd6 on /rootfs/media/test type ext4 (rw,relatime,stripe=16)
... Because OSD's Namespace still had the mount (as all rbd mapped...)
Hope someone could create a ticket for this bug in ceph bug tracker. I didn't find how to create a ticket without being "a member".
Regards
Nicolas FOURNIL
_______________________________________________ Dev mailing list -- dev@xxxxxxx To unsubscribe send an email to dev-leave@xxxxxxx