On Wed, Sep 28, 2022 at 2:22 AM Chris Dunlop <chris@xxxxxxxxxxxx> wrote: > > Hi all, > > On Fri, Sep 23, 2022 at 11:47:11AM +0200, Ilya Dryomov wrote: > > On Fri, Sep 23, 2022 at 5:58 AM Chris Dunlop <chris@xxxxxxxxxxxx> wrote: > >> On Wed, Sep 21, 2022 at 12:40:54PM +0200, Ilya Dryomov wrote: > >>> On Wed, Sep 21, 2022 at 3:36 AM Chris Dunlop <chris@xxxxxxxxxxxx> wrote: > >>>> On Tue, Sep 13, 2022 at 3:44 AM Chris Dunlop <chris@xxxxxxxxxxxx> wrote: > >>>>> What can make a "rbd unmap" fail, assuming the device is not > >>>>> mounted and not (obviously) open by any other processes? > >> > >> OK, I'm confident I now understand the cause of this problem. The > >> particular machine where I'm mounting the rbd snapshots is also > >> running some containerised ceph services. The ceph containers are > >> (bind-)mounting the entire host filesystem hierarchy on startup, and > >> if a ceph container happens to start up whilst a rbd device is > >> mounted, the container also has the rbd mounted, preventing the host > >> from unmapping the device even after the host has unmounted it. (More > >> below.) > >> > >> This brings up a couple of issues... > >> > >> Why is the ceph container getting access to the entire host > >> filesystem in the first place? > >> > >> Even if I mount an rbd device with the "unbindable" mount option, > >> which is specifically supposed to prevent bind mounts to that > >> filesystem, the ceph containers still get the mount - how / why?? > >> > >> If the ceph containers really do need access to the entire host > >> filesystem, perhaps it would be better to do a "slave" mount, so > >> if/when the hosts unmounts a filesystem it's also unmounted in the > >> container[s]. (Of course this also means any filesystems newly > >> mounted in the host would also appear in the containers - but that > >> happens anyway if the container is newly started). > > > > Thanks for the great analysis! I think ceph-volume container does it > > because of [1]. I'm not sure about "cephadm shell". There is also > > node-exporter container that needs access to the host for gathering > > metrics. > > > > [1] https://tracker.ceph.com/issues/52926 > > I'm guessing ceph-volume may need to see the host mounts so it can > detect a disk is being used. Could this also be done in the host (like > issue 52926 says is being done with pv/vg/lv commands), removing the > need to have the entire host filesystem hierarchy available in the > container? > > Similarly, I would have thought the node-exporter container only needs > access to ceph-specific files/directories rather than the whole system. > > On Tue, Sep 27, 2022 at 12:55:37PM +0200, Ilya Dryomov wrote: > > On Fri, Sep 23, 2022 at 3:06 PM Guillaume Abrioux <gabrioux@xxxxxxxxxx> wrote: > >> On Fri, 23 Sept 2022 at 05:59, Chris Dunlop <chris@xxxxxxxxxxxx> wrote: > >>> If the ceph containers really do need access to the entire host > >>> filesystem, perhaps it would be better to do a "slave" mount, > >> > >> Yes, I think a mount with 'slave' propagation should fix your issue. > >> I plan to do some tests next week and work on a patch. > > Thanks Guillaume. > > > I wanted to share an observation that there seem to be two cases here: > > actual containers (e.g. an OSD container) and cephadm shell which is > > technically also a container but may be regarded by users as a shell > > ("window") with some binaries and configuration files injected into > > it. > > For my part I don't see or use a cephadm shell as a normal shell with > additional stuff injected. At the very least the host root filesystem > location has changed to /rootfs so it's obviously not a standard shell. > > In fact I was quite surprised that the rootfs and all the other mounts > unrelated to ceph were available at all. I'm still not convinced it's a > good idea. > > In my conception a cephadm shell is a mini virtual machine specifically > for inspecting and managing ceph specific areas *only*. > > I guess it's really a difference of philosophy. I only use cephadm shell > when I'm explicitly needing to so something with ceph, and I drop back > out of the cephadm shell (and it's associated privleges!) as soon as I'm > done with that specific task. For everything else I'll be in my > (non-privileged) host shell. I can imagine (although I must say I'd be > surprised), that others may use the cephadm shell as a matter of course, > for managing the whole machine? Then again, given issue 52926 quoted > above, it sounds like that would be a bad idea if, for instance, the lvm > commands should NOT be run the container "in order to avoid lvm metadata > corruption" - i.e. it's not safe to assume a cephadm shell is a normal > shell. > > I would argue the goal should be to remove access to the general host > filesystem(s) from the ceph containers altogether where possible. > > I'll also admit that, generally, it's probably a bad idea to be doing > things unrelated to ceph on a box hosting ceph. But that's the way this > particular system has grown and unfortunately it will take quite a bit > of time, effort, and expense to change this now. > > > For the former, a unidirectional propagation such that when something > > is unmounted on the host it is also unmounted in the container is all > > that is needed. However, for the latter, a bidirectional propagation > > such that when something is mounted in this shell it is also mounted > > on the host (and therefore in all other windows) seems desirable. > > > > What do you think about going with MS_SLAVE for the former and > > MS_SHARED for the latter? > > Personally I would find it surprising and unexpected (i.e. potentially a > source of trouble) for mount changes done in a container (including a > "shell" container) to affect the host. But again, that may be that > difference of philosophy regarding the cephadm shell mentioned above. Hi Chris, Right, I see your point, particularly around /rootfs location making it obvious that it's not a standard shell. I don't have a strong opinion here, ultimately the fix is up to Adam and Guillaume (although I would definitely prefer a set of targeted mounts over a blanket -v /:/rootfs mount, whether slave or not). Thanks, Ilya