Re: ceph-volume simple disk scenario without LVM for OSD on PVC

Alfredo Deza <adeza@xxxxxxxxxx> · Tue, 3 Dec 2019 14:26:12 -0500

On Tue, Dec 3, 2019 at 11:56 AM Sebastien Han <shan@xxxxxxxxxx> wrote:
>
> Hi,
>
> I've started working on a saner way to deploy OSD with Rook so that
> they don't use the rook binary image.
>
> Why were/are we using the rook binary to activate the OSD?
>
> A bit of background on containers first, when executing a container,
> we need to provide a command entrypoint that will act as PID 1. So if
> you want to do pre/post action before running the process you need to
> use a wrapper. In Rook, that's the rook binary, which has a CLI and
> can then "activate" an OSD.
> Currently, this "rook osd activate" call does the following:
>
> * sed the lvm.conf
> * run c-v lvm activate
> * run the osd process
>
> On shutdown, we intercept the signal, "kill -9" the osd and de-activate the LV.
>
> I have a patch here: https://github.com/rook/rook/pull/4386, that
> solves the initial bullet points but one thing we cannot do is the
> signal catching and the lv de-activation.
> Before you ask, Kubernetes has pre/post-hook but they are not
> reliable, it's known and documented that there is no guarantee they
> would actually run before or after the container starts/stops. We
> tried and we had issues.
>
> Why do we want to stop using the rook binary for activation? Because
> each time we get a new binary version (new operator version), this
> will restart all the OSDs, even if the deployment spec didn't change,
> at least if nothing else than the rook image version changed.
>
> Also with containers, we have seen so many issues working with LVM,
> just to name a few:
>
> * adapt lvm filters
> * interactions with udev - need to tune the lvm config, even c-v
> itself has lvm flag to not sync with udev built-in
> * several bindmounts
> * lvm package must be present on the host even if running in containers
> * SELinux, yes lvm calls SELinux commands under the hood and pollute
> the logs in some scenarios
>
> Currently, one of the ways I can see this working is by not using LVM
> when bootstrapping OSDs. Unfortunately, some of the logic cannot go in
> the OSD code since the lv de-activation happens after the OSD stops.
> We need to de-activate the LV so when running in the Cloud the block
> can safely be re-attached to a new machine without LVM issues.
>
> I know this will be a bit challenging and might ultimately look like
> ceph-disk but it'd be nice to consider it.
> What about a small prototype for Bluestore with block/db/wal on the same disk?

You raise some good points here, and I agree that there are many
issues with containers and LVM. There were also quite a few issues
with ceph-disk in containers,
but those issues are not as relevant as making the OSD provisioning
easier for everyone else.

One of the main ideas I brought up when trying to design ceph-volume
was to be completely agnostic on how the OSDs came to be: partitions?
full devices? LVM? something else?

It was interesting to imagine a scenario where the setup didn't matter
much, and ceph-volume would just be in charge of "activating"
(ensuring everything is ready for the ceph-osd daemon). That idea got
push-back in favor of being opinionated and choosing LVM. The amount
of internals ceph-volume has to deal specifically with LVM is
enormous, because with LVM came the requests with having
more flexibility, and more options to make it easier to use.

The `simple` sub-command was an attempt to introduce the hands-off
approach to OSD activation, by requiring just a little bit of metadata
in /etc/ceph/osd/*.json, where each OSD would represent a single
JSON file with some information. That approach not only works well for
ceph-disk OSDs, but should also work well with whatever else that you
may come up with... have you tried with `simple` and not gotten
results? If so, what went wrong?

Another option if `simple` doesn't achieve what Rook needs, is perhaps
implementing a separate sub-command (ceph-volume container?) that
could be implemented as a plugin so that it reuses all the well-tested
utilities that ceph-volume already has. The ZFS plugin did something
like that already.

Creating OSDs on your (Rook's) own is a *very* hard task to get right,
not to mention the many different ways OSDs allow you to configure
them: filestore (dedicated, collocated),
bluestore (data, data+db, data+wal, data+db+wal), dmcrypt or
unencrypted. Plus other nuances like talking to the monitor, and
sending/retrieving information that has changed between releases.

>
> If this gets rejected, I might try a prototype for not using c-v in
> Rook or something else that might come up with this discussion.
>
> Thanks!
> –––––––––
> Sébastien Han
> Senior Principal Software Engineer, Storage Architect
>
> "Always give 100%. Unless you're giving blood."
> _______________________________________________
> Dev mailing list -- dev@xxxxxxx
> To unsubscribe send an email to dev-leave@xxxxxxx
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx