Re: rbd unmap fails with "Device or resource busy"

Chris Dunlop <chris@xxxxxxxxxxxx> · Wed, 28 Sep 2022 10:22:02 +1000

Hi all,

On Fri, Sep 23, 2022 at 11:47:11AM +0200, Ilya Dryomov wrote:
On Fri, Sep 23, 2022 at 5:58 AM Chris Dunlop <chris@xxxxxxxxxxxx> wrote:
On Wed, Sep 21, 2022 at 12:40:54PM +0200, Ilya Dryomov wrote:
On Wed, Sep 21, 2022 at 3:36 AM Chris Dunlop <chris@xxxxxxxxxxxx> wrote:
On Tue, Sep 13, 2022 at 3:44 AM Chris Dunlop <chris@xxxxxxxxxxxx> wrote:
What can make a "rbd unmap" fail, assuming the device is not 
mounted and not (obviously) open by any other processes?

OK, I'm confident I now understand the cause of this problem. The 
particular machine where I'm mounting the rbd snapshots is also 
running some containerised ceph services. The ceph containers are 
(bind-)mounting the entire host filesystem hierarchy on startup, and 
if a ceph container happens to start up whilst a rbd device is 
mounted, the container also has the rbd mounted, preventing the host 
from unmapping the device even after the host has unmounted it. (More 
below.)

This brings up a couple of issues...

Why is the ceph container getting access to the entire host 
filesystem in the first place?

Even if I mount an rbd device with the "unbindable" mount option, 
which is specifically supposed to prevent bind mounts to that 
filesystem, the ceph containers still get the mount - how / why??

If the ceph containers really do need access to the entire host 
filesystem, perhaps it would be better to do a "slave" mount, so 
if/when the hosts unmounts a filesystem it's also unmounted in the 
container[s].  (Of course this also means any filesystems newly 
mounted in the host would also appear in the containers - but that 
happens anyway if the container is newly started).

Thanks for the great analysis!  I think ceph-volume container does it 
because of [1].  I'm not sure about "cephadm shell".  There is also
node-exporter container that needs access to the host for gathering 
metrics.

[1] https://tracker.ceph.com/issues/52926

I'm guessing ceph-volume may need to see the host mounts so it can 
detect a disk is being used. Could this also be done in the host (like 
issue 52926 says is being done with pv/vg/lv commands), removing the 
need to have the entire host filesystem hierarchy available in the 
container?

Similarly, I would have thought the node-exporter container only needs 
access to ceph-specific files/directories rather than the whole system.

On Tue, Sep 27, 2022 at 12:55:37PM +0200, Ilya Dryomov wrote:
On Fri, Sep 23, 2022 at 3:06 PM Guillaume Abrioux <gabrioux@xxxxxxxxxx> wrote:
On Fri, 23 Sept 2022 at 05:59, Chris Dunlop <chris@xxxxxxxxxxxx> wrote:
If the ceph containers really do need access to the entire host 
filesystem, perhaps it would be better to do a "slave" mount,

Yes, I think a mount with 'slave' propagation should fix your issue.  
I plan to do some tests next week and work on a patch.

Thanks Guillaume.

I wanted to share an observation that there seem to be two cases here: 
actual containers (e.g. an OSD container) and cephadm shell which is 
technically also a container but may be regarded by users as a shell 
("window") with some binaries and configuration files injected into 
it.

For my part I don't see or use a cephadm shell as a normal shell with 
additional stuff injected. At the very least the host root filesystem 
location has changed to /rootfs so it's obviously not a standard shell.

In fact I was quite surprised that the rootfs and all the other mounts 
unrelated to ceph were available at all. I'm still not convinced it's a 
good idea.

In my conception a cephadm shell is a mini virtual machine specifically 
for inspecting and managing ceph specific areas *only*.

I guess it's really a difference of philosophy. I only use cephadm shell 
when I'm explicitly needing to so something with ceph, and I drop back 
out of the cephadm shell (and it's associated privleges!) as soon as I'm 
done with that specific task. For everything else I'll be in my 
(non-privileged) host shell. I can imagine (although I must say I'd be 
surprised), that others may use the cephadm shell as a matter of course, 
for managing the whole machine? Then again, given issue 52926 quoted 
above, it sounds like that would be a bad idea if, for instance, the lvm 
commands should NOT be run the container "in order to avoid lvm metadata 
corruption" - i.e. it's not safe to assume a cephadm shell is a normal 
shell.

I would argue the goal should be to remove access to the general host 
filesystem(s) from the ceph containers altogether where possible.

I'll also admit that, generally, it's probably a bad idea to be doing 
things unrelated to ceph on a box hosting ceph. But that's the way this 
particular system has grown and unfortunately it will take quite a bit 
of time, effort, and expense to change this now.

For the former, a unidirectional propagation such that when something 
is unmounted on the host it is also unmounted in the container is all 
that is needed.  However, for the latter, a bidirectional propagation 
such that when something is mounted in this shell it is also mounted 
on the host (and therefore in all other windows) seems desirable.

What do you think about going with MS_SLAVE for the former and 
MS_SHARED for the latter?

Personally I would find it surprising and unexpected (i.e. potentially a 
source of trouble) for mount changes done in a container (including a 
"shell" container) to affect the host. But again, that may be that 
difference of philosophy regarding the cephadm shell mentioned above.

Chris