Re: ceph-volume simple disk scenario without LVM for OSD on PVC

Sage Weil <sage@xxxxxxxxxxxx> · Tue, 3 Dec 2019 19:58:55 +0000 (UTC)

On Tue, 3 Dec 2019, Sebastien Han wrote:
> Hi,
> 
> I've started working on a saner way to deploy OSD with Rook so that
> they don't use the rook binary image.
> 
> Why were/are we using the rook binary to activate the OSD?
> 
> A bit of background on containers first, when executing a container,
> we need to provide a command entrypoint that will act as PID 1. So if
> you want to do pre/post action before running the process you need to
> use a wrapper. In Rook, that's the rook binary, which has a CLI and
> can then "activate" an OSD.
> Currently, this "rook osd activate" call does the following:
> 
> * sed the lvm.conf
> * run c-v lvm activate
> * run the osd process
> 
> On shutdown, we intercept the signal, "kill -9" the osd and de-activate 
> the LV.
> 
> I have a patch here: https://github.com/rook/rook/pull/4386, that
> solves the initial bullet points but one thing we cannot do is the
> signal catching and the lv de-activation.

What if we implement a ceph-volume command similar to 'activate' that 
*also* runs ceph-osd, catches the signal, and cleans up LVM afterwards?

It occurs to me that we probably want a similar sequence for ceph-daemon 
too in the dm-crypt case, where ideally we'd set up the encrypted 
device, start the osd, and on shutdown, tear it down again.

> Before you ask, Kubernetes has pre/post-hook but they are not
> reliable, it's known and documented that there is no guarantee they
> would actually run before or after the container starts/stops. We
> tried and we had issues.
> 
> Why do we want to stop using the rook binary for activation? Because
> each time we get a new binary version (new operator version), this
> will restart all the OSDs, even if the deployment spec didn't change,
> at least if nothing else than the rook image version changed.
> 
> Also with containers, we have seen so many issues working with LVM,
> just to name a few:
> 
> * adapt lvm filters
> * interactions with udev - need to tune the lvm config, even c-v
> itself has lvm flag to not sync with udev built-in
> * several bindmounts
> * lvm package must be present on the host even if running in containers
> * SELinux, yes lvm calls SELinux commands under the hood and pollute
> the logs in some scenarios
> 
> Currently, one of the ways I can see this working is by not using LVM
> when bootstrapping OSDs. Unfortunately, some of the logic cannot go in
> the OSD code since the lv de-activation happens after the OSD stops.
> We need to de-activate the LV so when running in the Cloud the block
> can safely be re-attached to a new machine without LVM issues.
> 
> I know this will be a bit challenging and might ultimately look like
> ceph-disk but it'd be nice to consider it.
> What about a small prototype for Bluestore with block/db/wal on the same disk?
> 
> If this gets rejected, I might try a prototype for not using c-v in
> Rook or something else that might come up with this discussion.

An LVM-less approach is appealing.  The main case that it doesn't cover is 
an encrypted device: we need somewhere to stash metadata about the 
encrypted device and the key that's used to fetch the decryption key.  
This is very awkward to do with a bare device.  I think we'll need 
something like it eventually for seastore, but I'm worried about building 
yet another not-quite-as-general-as-we'd-hoped scheme.

Something that explicitly does bluestore only and does not support dmcrypt 
could be pretty straightforward, though... I think it would basically just 
have to use the bluestore device label to populate a simple .json or 
/var/lib/ceph/osd directory inside the container.

sage
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx