Re: rbd unmap fails with "Device or resource busy"

Ilya Dryomov <idryomov@xxxxxxxxx> · Thu, 29 Sep 2022 13:14:17 +0200

On Wed, Sep 28, 2022 at 2:22 AM Chris Dunlop <chris@xxxxxxxxxxxx> wrote:
>
> Hi all,
>
> On Fri, Sep 23, 2022 at 11:47:11AM +0200, Ilya Dryomov wrote:
> > On Fri, Sep 23, 2022 at 5:58 AM Chris Dunlop <chris@xxxxxxxxxxxx> wrote:
> >> On Wed, Sep 21, 2022 at 12:40:54PM +0200, Ilya Dryomov wrote:
> >>> On Wed, Sep 21, 2022 at 3:36 AM Chris Dunlop <chris@xxxxxxxxxxxx> wrote:
> >>>> On Tue, Sep 13, 2022 at 3:44 AM Chris Dunlop <chris@xxxxxxxxxxxx> wrote:
> >>>>> What can make a "rbd unmap" fail, assuming the device is not
> >>>>> mounted and not (obviously) open by any other processes?
> >>
> >> OK, I'm confident I now understand the cause of this problem. The
> >> particular machine where I'm mounting the rbd snapshots is also
> >> running some containerised ceph services. The ceph containers are
> >> (bind-)mounting the entire host filesystem hierarchy on startup, and
> >> if a ceph container happens to start up whilst a rbd device is
> >> mounted, the container also has the rbd mounted, preventing the host
> >> from unmapping the device even after the host has unmounted it. (More
> >> below.)
> >>
> >> This brings up a couple of issues...
> >>
> >> Why is the ceph container getting access to the entire host
> >> filesystem in the first place?
> >>
> >> Even if I mount an rbd device with the "unbindable" mount option,
> >> which is specifically supposed to prevent bind mounts to that
> >> filesystem, the ceph containers still get the mount - how / why??
> >>
> >> If the ceph containers really do need access to the entire host
> >> filesystem, perhaps it would be better to do a "slave" mount, so
> >> if/when the hosts unmounts a filesystem it's also unmounted in the
> >> container[s].  (Of course this also means any filesystems newly
> >> mounted in the host would also appear in the containers - but that
> >> happens anyway if the container is newly started).
> >
> > Thanks for the great analysis!  I think ceph-volume container does it
> > because of [1].  I'm not sure about "cephadm shell".  There is also
> > node-exporter container that needs access to the host for gathering
> > metrics.
> >
> > [1] https://tracker.ceph.com/issues/52926
>
> I'm guessing ceph-volume may need to see the host mounts so it can
> detect a disk is being used. Could this also be done in the host (like
> issue 52926 says is being done with pv/vg/lv commands), removing the
> need to have the entire host filesystem hierarchy available in the
> container?
>
> Similarly, I would have thought the node-exporter container only needs
> access to ceph-specific files/directories rather than the whole system.
>
> On Tue, Sep 27, 2022 at 12:55:37PM +0200, Ilya Dryomov wrote:
> > On Fri, Sep 23, 2022 at 3:06 PM Guillaume Abrioux <gabrioux@xxxxxxxxxx> wrote:
> >> On Fri, 23 Sept 2022 at 05:59, Chris Dunlop <chris@xxxxxxxxxxxx> wrote:
> >>> If the ceph containers really do need access to the entire host
> >>> filesystem, perhaps it would be better to do a "slave" mount,
> >>
> >> Yes, I think a mount with 'slave' propagation should fix your issue.
> >> I plan to do some tests next week and work on a patch.
>
> Thanks Guillaume.
>
> > I wanted to share an observation that there seem to be two cases here:
> > actual containers (e.g. an OSD container) and cephadm shell which is
> > technically also a container but may be regarded by users as a shell
> > ("window") with some binaries and configuration files injected into
> > it.
>
> For my part I don't see or use a cephadm shell as a normal shell with
> additional stuff injected. At the very least the host root filesystem
> location has changed to /rootfs so it's obviously not a standard shell.
>
> In fact I was quite surprised that the rootfs and all the other mounts
> unrelated to ceph were available at all. I'm still not convinced it's a
> good idea.
>
> In my conception a cephadm shell is a mini virtual machine specifically
> for inspecting and managing ceph specific areas *only*.
>
> I guess it's really a difference of philosophy. I only use cephadm shell
> when I'm explicitly needing to so something with ceph, and I drop back
> out of the cephadm shell (and it's associated privleges!) as soon as I'm
> done with that specific task. For everything else I'll be in my
> (non-privileged) host shell. I can imagine (although I must say I'd be
> surprised), that others may use the cephadm shell as a matter of course,
> for managing the whole machine? Then again, given issue 52926 quoted
> above, it sounds like that would be a bad idea if, for instance, the lvm
> commands should NOT be run the container "in order to avoid lvm metadata
> corruption" - i.e. it's not safe to assume a cephadm shell is a normal
> shell.
>
> I would argue the goal should be to remove access to the general host
> filesystem(s) from the ceph containers altogether where possible.
>
> I'll also admit that, generally, it's probably a bad idea to be doing
> things unrelated to ceph on a box hosting ceph. But that's the way this
> particular system has grown and unfortunately it will take quite a bit
> of time, effort, and expense to change this now.
>
> > For the former, a unidirectional propagation such that when something
> > is unmounted on the host it is also unmounted in the container is all
> > that is needed.  However, for the latter, a bidirectional propagation
> > such that when something is mounted in this shell it is also mounted
> > on the host (and therefore in all other windows) seems desirable.
> >
> > What do you think about going with MS_SLAVE for the former and
> > MS_SHARED for the latter?
>
> Personally I would find it surprising and unexpected (i.e. potentially a
> source of trouble) for mount changes done in a container (including a
> "shell" container) to affect the host. But again, that may be that
> difference of philosophy regarding the cephadm shell mentioned above.

Hi Chris,

Right, I see your point, particularly around /rootfs location making it
obvious that it's not a standard shell.  I don't have a strong opinion
here, ultimately the fix is up to Adam and Guillaume (although I would
definitely prefer a set of targeted mounts over a blanket -v /:/rootfs
mount, whether slave or not).

Thanks,

                Ilya