On Tue, 3 Dec 2019, Sebastien Han wrote: > Hi, > > I've started working on a saner way to deploy OSD with Rook so that > they don't use the rook binary image. > > Why were/are we using the rook binary to activate the OSD? > > A bit of background on containers first, when executing a container, > we need to provide a command entrypoint that will act as PID 1. So if > you want to do pre/post action before running the process you need to > use a wrapper. In Rook, that's the rook binary, which has a CLI and > can then "activate" an OSD. > Currently, this "rook osd activate" call does the following: > > * sed the lvm.conf > * run c-v lvm activate > * run the osd process > > On shutdown, we intercept the signal, "kill -9" the osd and de-activate > the LV. > > I have a patch here: https://github.com/rook/rook/pull/4386, that > solves the initial bullet points but one thing we cannot do is the > signal catching and the lv de-activation. What if we implement a ceph-volume command similar to 'activate' that *also* runs ceph-osd, catches the signal, and cleans up LVM afterwards? It occurs to me that we probably want a similar sequence for ceph-daemon too in the dm-crypt case, where ideally we'd set up the encrypted device, start the osd, and on shutdown, tear it down again. > Before you ask, Kubernetes has pre/post-hook but they are not > reliable, it's known and documented that there is no guarantee they > would actually run before or after the container starts/stops. We > tried and we had issues. > > Why do we want to stop using the rook binary for activation? Because > each time we get a new binary version (new operator version), this > will restart all the OSDs, even if the deployment spec didn't change, > at least if nothing else than the rook image version changed. > > Also with containers, we have seen so many issues working with LVM, > just to name a few: > > * adapt lvm filters > * interactions with udev - need to tune the lvm config, even c-v > itself has lvm flag to not sync with udev built-in > * several bindmounts > * lvm package must be present on the host even if running in containers > * SELinux, yes lvm calls SELinux commands under the hood and pollute > the logs in some scenarios > > Currently, one of the ways I can see this working is by not using LVM > when bootstrapping OSDs. Unfortunately, some of the logic cannot go in > the OSD code since the lv de-activation happens after the OSD stops. > We need to de-activate the LV so when running in the Cloud the block > can safely be re-attached to a new machine without LVM issues. > > I know this will be a bit challenging and might ultimately look like > ceph-disk but it'd be nice to consider it. > What about a small prototype for Bluestore with block/db/wal on the same disk? > > If this gets rejected, I might try a prototype for not using c-v in > Rook or something else that might come up with this discussion. An LVM-less approach is appealing. The main case that it doesn't cover is an encrypted device: we need somewhere to stash metadata about the encrypted device and the key that's used to fetch the decryption key. This is very awkward to do with a bare device. I think we'll need something like it eventually for seastore, but I'm worried about building yet another not-quite-as-general-as-we'd-hoped scheme. Something that explicitly does bluestore only and does not support dmcrypt could be pretty straightforward, though... I think it would basically just have to use the bluestore device label to populate a simple .json or /var/lib/ceph/osd directory inside the container. sage _______________________________________________ Dev mailing list -- dev@xxxxxxx To unsubscribe send an email to dev-leave@xxxxxxx