Re: ceph-volume simple disk scenario without LVM for OSD on PVC

Blaine Gardner <BlGardner@xxxxxxxx> · Tue, 3 Dec 2019 19:40:33 +0000

It looks like preStop hooks might be useful if ceph daemons were to accept some sort of quit signal like
nginx.

https://chrislovecnm.com/kubernetes/best-practices/highly-available-resilient-applications-1-of-3/

From: Blaine Gardner <BlGardner@xxxxxxxx>

Sent: Tuesday, December 3, 2019 12:17

To: Sebastien Han <shan@xxxxxxxxxx>; dev@xxxxxxx <dev@xxxxxxx>

Cc: Travis Nielsen <tnielsen@xxxxxxxxxx>

Subject: Re: ceph-volume simple disk scenario without LVM for OSD on PVC

We could play with making the entrypoing a simple bash script that traps the EXIT signal to kill the captured PID of the OSD. A `wait` might or might not be necessary.

Similar thing in practice: 
http://mywiki.wooledge.org/SignalTrap#When_is_the_signal_handled.3F

Blaine

From: Sebastien Han <shan@xxxxxxxxxx>

Sent: Tuesday, December 3, 2019 09:55

To: dev@xxxxxxx <dev@xxxxxxx>

Cc: Travis Nielsen <tnielsen@xxxxxxxxxx>

Subject: ceph-volume simple disk scenario without LVM for OSD on PVC

Hi,

I've started working on a saner way to deploy OSD with Rook so that

they don't use the rook binary image.

Why were/are we using the rook binary to activate the OSD?

A bit of background on containers first, when executing a container,

we need to provide a command entrypoint that will act as PID 1. So if

you want to do pre/post action before running the process you need to

use a wrapper. In Rook, that's the rook binary, which has a CLI and

can then "activate" an OSD.

Currently, this "rook osd activate" call does the following:

* sed the lvm.conf

* run c-v lvm activate

* run the osd process

On shutdown, we intercept the signal, "kill -9" the osd and de-activate the LV.

I have a patch here: https://github.com/rook/rook/pull/4386, that

solves the initial bullet points but one thing we cannot do is the

signal catching and the lv de-activation.

Before you ask, Kubernetes has pre/post-hook but they are not

reliable, it's known and documented that there is no guarantee they

would actually run before or after the container starts/stops. We

tried and we had issues.

Why do we want to stop using the rook binary for activation? Because

each time we get a new binary version (new operator version), this

will restart all the OSDs, even if the deployment spec didn't change,

at least if nothing else than the rook image version changed.

Also with containers, we have seen so many issues working with LVM,

just to name a few:

* adapt lvm filters

* interactions with udev - need to tune the lvm config, even c-v

itself has lvm flag to not sync with udev built-in

* several bindmounts

* lvm package must be present on the host even if running in containers

* SELinux, yes lvm calls SELinux commands under the hood and pollute

the logs in some scenarios

Currently, one of the ways I can see this working is by not using LVM

when bootstrapping OSDs. Unfortunately, some of the logic cannot go in

the OSD code since the lv de-activation happens after the OSD stops.

We need to de-activate the LV so when running in the Cloud the block

can safely be re-attached to a new machine without LVM issues.

I know this will be a bit challenging and might ultimately look like

ceph-disk but it'd be nice to consider it.

What about a small prototype for Bluestore with block/db/wal on the same disk?

If this gets rejected, I might try a prototype for not using c-v in

Rook or something else that might come up with this discussion.

Thanks!

–––––––––

Sébastien Han

Senior Principal Software Engineer, Storage Architect

"Always give 100%. Unless you're giving blood."

_______________________________________________

Dev mailing list -- dev@xxxxxxx

To unsubscribe send an email to dev-leave@xxxxxxx

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx