On Fri, Sep 23, 2022 at 5:58 AM Chris Dunlop <chris@xxxxxxxxxxxx> wrote: > > Hi Ilya, > > On Wed, Sep 21, 2022 at 12:40:54PM +0200, Ilya Dryomov wrote: > > On Wed, Sep 21, 2022 at 3:36 AM Chris Dunlop <chris@xxxxxxxxxxxx> wrote: > >> On Tue, Sep 13, 2022 at 3:44 AM Chris Dunlop <chris@xxxxxxxxxxxx> wrote: > >>> What can make a "rbd unmap" fail, assuming the device is not > >>> mounted and not (obviously) open by any other processes? > > OK, I'm confident I now understand the cause of this problem. The > particular machine where I'm mounting the rbd snapshots is also running > some containerised ceph services. The ceph containers are > (bind-)mounting the entire host filesystem hierarchy on startup, and if > a ceph container happens to start up whilst a rbd device is mounted, the > container also has the rbd mounted, preventing the host from unmapping > the device even after the host has unmounted it. (More below.) > > This brings up a couple of issues... > > Why is the ceph container getting access to the entire host filesystem > in the first place? > > Even if I mount an rbd device with the "unbindable" mount option, which > is specifically supposed to prevent bind mounts to that filesystem, the > ceph containers still get the mount - how / why?? > > If the ceph containers really do need access to the entire host > filesystem, perhaps it would be better to do a "slave" mount, so if/when > the hosts unmounts a filesystem it's also unmounted in the container[s]. > (Of course this also means any filesystems newly mounted in the host > would also appear in the containers - but that happens anyway if the > container is newly started). > > >> An unsuccessful iteration looks like this: > >> > >> 18:37:31.885408 O 3294108 rbd29 0 mapper > >> 18:37:33.181607 R 3294108 rbd29 1 mapper > >> 18:37:33.182086 O 3294175 rbd29 0 systemd-udevd > >> 18:37:33.197982 O 3294691 rbd29 1 blkid > >> 18:37:42.712870 R 3294691 rbd29 2 blkid > >> 18:37:42.716296 R 3294175 rbd29 1 systemd-udevd > >> 18:37:42.738469 O 3298073 rbd29 0 mount > >> 18:37:49.339012 R 3298073 rbd29 1 mount > >> 18:37:49.339352 O 3298073 rbd29 0 mount > >> 18:38:51.390166 O 2364320 rbd29 1 rpc.mountd > >> 18:39:00.989050 R 2364320 rbd29 2 rpc.mountd > >> 18:53:56.054685 R 3313923 rbd29 1 init > >> > >> According to my script log, the first unmap attempt was at 18:39:42, > >> i.e. 42 seconds after rpc.mountd released the device. At that point the > >> the open_count was (or should have been?) 1 again allowing the unmap to > >> succeed - but it didn't. The unmap was retried every second until it > > > > For unmap to go through, open_count must be 0. rpc.mountd at > > 18:39:00.989050 just decremented it from 2 to 1, it didn't release > > the device. > > Yes - but my poorly made point was that, per the normal test iteration, > some time shortly after rpc.mountd decremented open_count to 1, an > "umount" command was run successfully (the test would have aborted if > the umount didn't succeed) - but the "umount" didn't show up in the > bpftrace output. Immediately after the umount a "rbd unmap" was run, > which failed with "busy" - i.e. the open_count was still incremented. > > >> eventually succeeded at 18:53:56, the same time as the mysterious > >> "init" process ran - but also note there is NO "umount" process in > >> there so I don't know if the name of the process recorded by bfptrace > >> is simply incorrect (but how would that happen??) or what else could > >> be going on. > > Using "ps" once the unmap starts failing, then cross checking against > the process id recorded for the mysterious "init" in the bpftrace > output, reveals the full command line for the "init" is: > > /dev/init -- /usr/sbin/ceph-volume inventory --format=json-pretty --filter-for-batch > > I.e. it's the 'init' process of a ceph-volume container that eventually > releases the open_count. > > After doing a lot of learning about ceph and containers (podman in this > case) and namespaces etc. etc., the problem is now known... > > Ceph containers are started with '-v "/:/rootfs"' which bind mounts the > entire host's filesystem hierarchy into the container. Specifically, if > the host has mounted filesystems, they're also mounted within the > container when it starts up. So, if a ceph container starts up whilst > there is a filesystem mounted from an rbd mapped device, the container > also has that mount - and it retains the mount even if the filesystem is > unmounted in the host. So the rbd device can't be unmapped in the host > until the filesystem is released by the container, either via an explicit > umount within the container, or a umount from the host targetting the > container namespace, or the container exits. > > This explains the mysterious 51 rbd devices that I haven't been able to > unmap for a week: they're all mounted within long-running ceph containers > that happened to start up whilst those 51 devices were all mounted > somewhere. I've now been able to unmap those devices after unmounting the > filesystems within those containers using: > > umount --namespace "${pid_of_container}" "${fs}" > > > ------------------------------------------------------------ > An example demonstrating the problem > ------------------------------------------------------------ > # > # Mount a snapshot, with "unbindable" > # > host# { > rbd=pool/name@snap > dev=$(rbd device map "${rbd}") > declare -p dev > mount -oro,norecovery,nouuid,unbindable "${dev}" "/mnt" > echo -- > grep "${dev}" /proc/self/mountinfo > echo -- > ls /mnt > echo -- > } > declare -- dev="/dev/rbd30" > -- > 1463 22 252:480 / /mnt ro unbindable - xfs /dev/rbd30 ro,nouuid,norecovery > -- > file1 file2 file3 > > # > # The mount is still visible if we start a ceph container > # > host# cephadm shell > root@host:/# ls /mnt > file1 file2 file3 > > # > # The device is not unmappable from the host... > # > host# umount /mnt > host# rbd device unmap "${dev}" > rbd: sysfs write failed > rbd: unmap failed: (16) Device or resource busy > > # > # ...until we umount the filesystem within the container > # > # > host# lsns -t mnt > NS TYPE NPROCS PID USER COMMAND > 4026533050 mnt 2 3105356 root /dev/init -- bash > host# umount --namespace 3105356 /mnt > host# rbd device unmap "${dev}" > ## success > ------------------------------------------------------------ Hi Chris, Thanks for the great analysis! I think ceph-volume container does it because of [1]. I'm not sure about "cephadm shell". There is also node-exporter container that needs access to the host for gathering metrics. I'm adding Adam (cephadm maintainer) and Guillaume (ceph-volume maintainer) as this is something that clearly wasn't intended. [1] https://tracker.ceph.com/issues/52926 Ilya > > > >> The bpftrace script looks like this: > > > > It would be good to attach the entire script, just in case someone runs > > into a similar issue in the future and tries to debug the same way. > > Attached. > > Cheers, > > Chris