On Tue, Dec 3, 2019 at 11:56 AM Sebastien Han <shan@xxxxxxxxxx> wrote: > > Hi, > > I've started working on a saner way to deploy OSD with Rook so that > they don't use the rook binary image. > > Why were/are we using the rook binary to activate the OSD? > > A bit of background on containers first, when executing a container, > we need to provide a command entrypoint that will act as PID 1. So if > you want to do pre/post action before running the process you need to > use a wrapper. In Rook, that's the rook binary, which has a CLI and > can then "activate" an OSD. > Currently, this "rook osd activate" call does the following: > > * sed the lvm.conf > * run c-v lvm activate > * run the osd process > > On shutdown, we intercept the signal, "kill -9" the osd and de-activate the LV. > > I have a patch here: https://github.com/rook/rook/pull/4386, that > solves the initial bullet points but one thing we cannot do is the > signal catching and the lv de-activation. > Before you ask, Kubernetes has pre/post-hook but they are not > reliable, it's known and documented that there is no guarantee they > would actually run before or after the container starts/stops. We > tried and we had issues. > > Why do we want to stop using the rook binary for activation? Because > each time we get a new binary version (new operator version), this > will restart all the OSDs, even if the deployment spec didn't change, > at least if nothing else than the rook image version changed. > > Also with containers, we have seen so many issues working with LVM, > just to name a few: > > * adapt lvm filters > * interactions with udev - need to tune the lvm config, even c-v > itself has lvm flag to not sync with udev built-in > * several bindmounts > * lvm package must be present on the host even if running in containers > * SELinux, yes lvm calls SELinux commands under the hood and pollute > the logs in some scenarios > > Currently, one of the ways I can see this working is by not using LVM > when bootstrapping OSDs. Unfortunately, some of the logic cannot go in > the OSD code since the lv de-activation happens after the OSD stops. > We need to de-activate the LV so when running in the Cloud the block > can safely be re-attached to a new machine without LVM issues. > > I know this will be a bit challenging and might ultimately look like > ceph-disk but it'd be nice to consider it. > What about a small prototype for Bluestore with block/db/wal on the same disk? You raise some good points here, and I agree that there are many issues with containers and LVM. There were also quite a few issues with ceph-disk in containers, but those issues are not as relevant as making the OSD provisioning easier for everyone else. One of the main ideas I brought up when trying to design ceph-volume was to be completely agnostic on how the OSDs came to be: partitions? full devices? LVM? something else? It was interesting to imagine a scenario where the setup didn't matter much, and ceph-volume would just be in charge of "activating" (ensuring everything is ready for the ceph-osd daemon). That idea got push-back in favor of being opinionated and choosing LVM. The amount of internals ceph-volume has to deal specifically with LVM is enormous, because with LVM came the requests with having more flexibility, and more options to make it easier to use. The `simple` sub-command was an attempt to introduce the hands-off approach to OSD activation, by requiring just a little bit of metadata in /etc/ceph/osd/*.json, where each OSD would represent a single JSON file with some information. That approach not only works well for ceph-disk OSDs, but should also work well with whatever else that you may come up with... have you tried with `simple` and not gotten results? If so, what went wrong? Another option if `simple` doesn't achieve what Rook needs, is perhaps implementing a separate sub-command (ceph-volume container?) that could be implemented as a plugin so that it reuses all the well-tested utilities that ceph-volume already has. The ZFS plugin did something like that already. Creating OSDs on your (Rook's) own is a *very* hard task to get right, not to mention the many different ways OSDs allow you to configure them: filestore (dedicated, collocated), bluestore (data, data+db, data+wal, data+db+wal), dmcrypt or unencrypted. Plus other nuances like talking to the monitor, and sending/retrieving information that has changed between releases. > > If this gets rejected, I might try a prototype for not using c-v in > Rook or something else that might come up with this discussion. > > Thanks! > ––––––––– > Sébastien Han > Senior Principal Software Engineer, Storage Architect > > "Always give 100%. Unless you're giving blood." > _______________________________________________ > Dev mailing list -- dev@xxxxxxx > To unsubscribe send an email to dev-leave@xxxxxxx _______________________________________________ Dev mailing list -- dev@xxxxxxx To unsubscribe send an email to dev-leave@xxxxxxx