Re: [RFE] ceph-volume prepare and activate enhancements for containers

Jan Fajerski <jfajerski@xxxxxxxx> · Mon, 9 Dec 2019 09:52:43 +0000



On Fri, Dec 06, 2019 at 03:41:29PM +0000, Sage Weil wrote:
>On Fri, 6 Dec 2019, Sebastien Han wrote:
>> Cool, that works for me!
>
>Okay, so this won't work for a few reasons: (1) ceph-osd drops root privs
>so we can't do anything fancy on shutdown, and (2) the signal handler
>isn't set up right when the process starts, so it'll always be racy (the
>teardown process might not happen).  Having the caller do this is really
>the right thing.
>
>After chatting with Seb, though, I think we really have two different
>problems:
>
>1) Seb's AWS problem: you can't do an EBS detach of there is an active
>VG(/LV) on the device.  To fix this, you need to do vgchange -an, which
>deinitializes the LVs and VG.  AFAICS, this doesn't many any sense on a
>bare-metal host, and would step on the toes of the generic LVM and udev
>infrastructure, which magically initalizes all the LV devices it finds
>(and AFAICS doesn't every try to disable that).  (Also, IIRC c-v has a VG
>per cluster, so if you deinitialize the entire VG, wouldn't that kill all
>OSDs on the host for that cluster.. not just the one on the EBS volume
>you're detaching?).
VGs are generally per device. In some cases c-v creates multi-device VGs but 
there is a bug open for that and I'm working on removing this scenario. However 
a VG might still contain volumes from multiple OSDs (multi-device OSDs) so 
deactivating a VG might still kill multiple OSDs.
>
>In any case, the problem feels like an EBS vs LVM problem.  And I think
>I'm back to Seb's original proposal here: the simplest way to solve this
>is to just not use LVM at all and to put bluestore on the raw device.
>You won't get dmcrypt or other fancy LVM features, but for EBS you don't
>need any of them (except, maybe, in the future, growing a volume/OSD, but
>that's something we need to teach bluestore to do regardless).
>
>2) My ceph-daemon problem: to make dmcrypt work (well), IMO the decrypted
>device should be set up when the OSD container is started, and torn down
>when the container stops.  For this, the thing that makes sense in my mind
>is something like a '-f' flag for ceph-volume activate. IIUC, right now
>activate does something like
>
>1- set up decrypted LV, if needed
>2- populate /var/lib/ceph/osd/ceph-NN dir
>3- start systemd unit (unless the --no-systemd flag is passed, as we
>currently do with containers)
>4- exit.
>
>Instead, with the -f flag, it would
>
>1,2- same
>3- run ceph-osd -f -i ... in the foreground.  watch for signals and
>pass them along to shut down the osd
>4- clean up /var/lib/ceph/osd/ceph-NN
>5- stop the decrypted LV
>6- exit
>
>This makes me realize that steps 4 and 5 don't currently exist anywhere:
>there is no such thing as 'ceph-volume lvm deactivate'.  If we had that
>second part, a simple wrapper could accomplish the same thing as -f.
Afaics this exists for simple mode, not for the lvm case though.
In any case I think it makes sense to enable c-v to wrap the osd and clean up 
after it.
>
>I think we should pursue those 2 paths (barebones bluestore c-v mode and
>c-v lvm deactivate) separately...
>
>sage
>
>
>> –––––––––
>> Sébastien Han
>> Senior Principal Software Engineer, Storage Architect
>>
>> "Always give 100%. Unless you're giving blood."
>>
>> On Fri, Dec 6, 2019 at 3:03 PM Sage Weil <sweil@xxxxxxxxxx> wrote:
>> >
>> > On Fri, 6 Dec 2019, Sebastien Han wrote:
>> > > If not in ceph-osd, can we have the ceph-osd executing a hook before exiting 0?
>> > > Reading a hook script from /etc/ceph/hook.d something like that would
>> > > be nice so that we don't need a wrapper.
>> >
>> > Hmm, maybe if it was just osd_exec_on_shutdown=string, and that could
>> > be something like "vgchange ..." or "bash -c ..."?  We'd need to make
>> > sure we're setting FD_CLOEXEC on all the right file handles though.  I can
>> > give it a go..
>> >
>> > sage
>> >
>> > >
>> > > Thoughts?
>> > >
>> > > Thanks!
>> > > –––––––––
>> > > Sébastien Han
>> > > Senior Principal Software Engineer, Storage Architect
>> > >
>> > > "Always give 100%. Unless you're giving blood."
>> > >
>> > > On Fri, Dec 6, 2019 at 2:50 PM Sage Weil <sweil@xxxxxxxxxx> wrote:
>> > > >
>> > > > On Fri, 6 Dec 2019, Sebastien Han wrote:
>> > > > > I understand this is asking a lot from the ceph-volume side.
>> > > > > We can explore a new wrapper binary or perhaps from the ceph-osd itself.
>> > > > >
>> > > > > Maybe crazy/stupid idea, can we have a de-activate call from the osd
>> > > > > process itself? ceph-osd gets SIGTERM, closes the connection to the
>> > > > > device, then runs "vgchange -an <vg>", is this realistic?
>> > > >
>> > > > Not really... it's hard (or gross) to do a hard/immediate exit that tears
>> > > > down all of the open handles to the device.  I think this is not a nice
>> > > > way to layer things.  I'd prefer either a c-v command or separate wrapper
>> > > > script to this.
>> > > >
>> > > > sage
>> > > >
>> > > >
>> > > > >
>> > > > > Thanks!
>> > > > > –––––––––
>> > > > > Sébastien Han
>> > > > > Senior Principal Software Engineer, Storage Architect
>> > > > >
>> > > > > "Always give 100%. Unless you're giving blood."
>> > > > >
>> > > > > On Fri, Dec 6, 2019 at 1:44 PM Alfredo Deza <adeza@xxxxxxxxxx> wrote:
>> > > > > >
>> > > > > > On Fri, Dec 6, 2019 at 5:59 AM Sebastien Han <shan@xxxxxxxxxx> wrote:
>> > > > > > >
>> > > > > > > Hi,
>> > > > > > >
>> > > > > > > Following up on my previous ceph-volume email as promised.
>> > > > > > >
>> > > > > > > When running Ceph with Rook in Kubernetes in the Cloud (Aws, Azure,
>> > > > > > > Google, whatever), the OSDs are backed by PVC (Cloud block storage)
>> > > > > > > attached to virtual machines.
>> > > > > > > This makes the storage portable if the VM dies, the device will be
>> > > > > > > attached to a new virtual machine and the OSD will resume running.
>> > > > > > >
>> > > > > > > In Rook, we have 2 main deployments for the OSD:
>> > > > > > >
>> > > > > > > 1. Prepare the disk to become an OSD
>> > > > > > > Prepare will run on the VM, attach the block device, run "ceph-volume
>> > > > > > > prepare", then this gets complicated. After this, the device is
>> > > > > > > supposed to be detached from the VM because the container terminated.
>> > > > > > > However, the block is still held by LVM so the VG must be
>> > > > > > > de-activated. Currently, we do this in Rook, but it would be nice to
>> > > > > > > de-activate the VG once ceph-volume is done preparing the disk in a
>> > > > > > > container.
>> > > > > > >
>> > > > > > > 2. Activate the OSD.
>> > > > > > > Now, onto the new container, the device is attached again on the VM.
>> > > > > > > At this point, more changes will be required in ceph-volume,
>> > > > > > > particularly in the "activate" call.
>> > > > > > >   a. ceph-volume should activate the VG
>> > > > > >
>> > > > > > By VG you mean LVM's Volume Group?
>> > > > > >
>> > > > > > >   b. ceph-volume should activate the device normally
>> > > > > >
>> > > > > > Not "normally" though right? That would imply starting the OSD which
>> > > > > > you are indicating is not desired.
>> > > > > >
>> > > > > > >   c. ceph-volume should run the ceph-osd process in foreground as well
>> > > > > > > as accepting flag to that CLI, we could have something like:
>> > > > > > > "ceph-volume lvm activate --no-systemd $STORE_FALG $OSD_ID $OSD_UUID
>> > > > > > > <a bunch of flags>"
>> > > > > > >   Perhaps we need a new flag to indicate we want to run the osd
>> > > > > > > process in foreground?
>> > > > > > >   Here is an example on how an OSD run today:
>> > > > > > >
>> > > > > > >   ceph-osd --foreground --id 2 --fsid
>> > > > > > > 9a531951-50f2-4d48-b012-0aef0febc301 --setuser ceph --setgroup ceph
>> > > > > > > --crush-location=root=default host=minikube --default-log-to-file
>> > > > > > > false --ms-learn-addr-from-peer=false
>> > > > > > >
>> > > > > > >   --> we can have a bunch of flags or an ENV var with all the flags
>> > > > > > > whatever you prefer.
>> > > > > > >
>> > > > > > >   This wrapper should watch for signals too, it should reply to
>> > > > > > > SIGTERM in the following way:
>> > > > > > >     - stop the OSD
>> > > > > > >     - de-activate the VG
>> > > > > > >     - exit 0
>> > > > > > >
>> > > > > > > Just a side note, the VG must be de-activated when the container stops
>> > > > > > > so that the block device can be detached from the VMs, otherwise,
>> > > > > > > it'll still be held by LVM.
>> > > > > >
>> > > > > > I am worried that this goes beyond what I consider the scope of
>> > > > > > ceph-volume which is: prepare device(s) to be part of an OSD.
>> > > > > >
>> > > > > > Catching signals, handling the OSD in the foreground, and accepting
>> > > > > > (proxying) flags, sounds problematic for a robust implementation in
>> > > > > > ceph-volume, even
>> > > > > > if that means it will help Rook in this case.
>> > > > > >
>> > > > > > The other challenge I see is that it seems Ceph is in a transition
>> > > > > > from being a baremetal project to a container one, except lots of
>> > > > > > tooling (like ceph-volume) is deeply
>> > > > > > tied to the non-containerized workflows. This makes it difficult (and
>> > > > > > non-obvious!) in ceph-volume when adding more flags to do things that
>> > > > > > help the containerized
>> > > > > > deployment.
>> > > > > >
>> > > > > > To solve the issues you describe, I think you need either a separate
>> > > > > > command-line tool that can invoke ceph-volume with the added features
>> > > > > > you listed, or
>> > > > > > if there is significant push to get more things in ceph-volume, a
>> > > > > > separate sub-command, so that the `lvm` is isolated from the
>> > > > > > conflicting logic.
>> > > > > >
>> > > > > > My preference would be a wrapper script, separate from the Ceph project.
>> > > > > >
>> > > > > > >
>> > > > > > > Hopefully, I was clear :).
>> > > > > > > This is just a proposal if you feel like this could be done
>> > > > > > > differently, feel free to suggest.
>> > > > > > >
>> > > > > > > Thanks!
>> > > > > > > –––––––––
>> > > > > > > Sébastien Han
>> > > > > > > Senior Principal Software Engineer, Storage Architect
>> > > > > > >
>> > > > > > > "Always give 100%. Unless you're giving blood."
>> > > > > > > _______________________________________________
>> > > > > > > Dev mailing list -- dev@xxxxxxx
>> > > > > > > To unsubscribe send an email to dev-leave@xxxxxxx
>> > > > > >
>> > > > > _______________________________________________
>> > > > > Dev mailing list -- dev@xxxxxxx
>> > > > > To unsubscribe send an email to dev-leave@xxxxxxx
>> > > > >
>> > > _______________________________________________
>> > > Dev mailing list -- dev@xxxxxxx
>> > > To unsubscribe send an email to dev-leave@xxxxxxx
>> > >
>> _______________________________________________
>> Dev mailing list -- dev@xxxxxxx
>> To unsubscribe send an email to dev-leave@xxxxxxx
>>

>_______________________________________________
>Dev mailing list -- dev@xxxxxxx
>To unsubscribe send an email to dev-leave@xxxxxxx


-- 
Jan Fajerski
Senior Software Engineer Enterprise Storage
SUSE Software Solutions Germany GmbH
Maxfeldstr. 5, 90409 Nürnberg, Germany
(HRB 36809, AG Nürnberg)
Geschäftsführer: Felix Imendörffer
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx