Re: [RFE] ceph-volume prepare and activate enhancements for containers

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, 6 Dec 2019, Sebastien Han wrote:
> Cool, that works for me!

Okay, so this won't work for a few reasons: (1) ceph-osd drops root privs 
so we can't do anything fancy on shutdown, and (2) the signal handler 
isn't set up right when the process starts, so it'll always be racy (the 
teardown process might not happen).  Having the caller do this is really 
the right thing.

After chatting with Seb, though, I think we really have two different 
problems:

1) Seb's AWS problem: you can't do an EBS detach of there is an active 
VG(/LV) on the device.  To fix this, you need to do vgchange -an, which 
deinitializes the LVs and VG.  AFAICS, this doesn't many any sense on a 
bare-metal host, and would step on the toes of the generic LVM and udev 
infrastructure, which magically initalizes all the LV devices it finds 
(and AFAICS doesn't every try to disable that).  (Also, IIRC c-v has a VG 
per cluster, so if you deinitialize the entire VG, wouldn't that kill all 
OSDs on the host for that cluster.. not just the one on the EBS volume 
you're detaching?).

In any case, the problem feels like an EBS vs LVM problem.  And I think 
I'm back to Seb's original proposal here: the simplest way to solve this 
is to just not use LVM at all and to put bluestore on the raw device.  
You won't get dmcrypt or other fancy LVM features, but for EBS you don't 
need any of them (except, maybe, in the future, growing a volume/OSD, but 
that's something we need to teach bluestore to do regardless).

2) My ceph-daemon problem: to make dmcrypt work (well), IMO the decrypted 
device should be set up when the OSD container is started, and torn down 
when the container stops.  For this, the thing that makes sense in my mind 
is something like a '-f' flag for ceph-volume activate. IIUC, right now 
activate does something like

1- set up decrypted LV, if needed
2- populate /var/lib/ceph/osd/ceph-NN dir
3- start systemd unit (unless the --no-systemd flag is passed, as we 
currently do with containers)
4- exit.

Instead, with the -f flag, it would

1,2- same
3- run ceph-osd -f -i ... in the foreground.  watch for signals and 
pass them along to shut down the osd
4- clean up /var/lib/ceph/osd/ceph-NN
5- stop the decrypted LV
6- exit

This makes me realize that steps 4 and 5 don't currently exist anywhere: 
there is no such thing as 'ceph-volume lvm deactivate'.  If we had that 
second part, a simple wrapper could accomplish the same thing as -f.

I think we should pursue those 2 paths (barebones bluestore c-v mode and 
c-v lvm deactivate) separately...

sage


> –––––––––
> Sébastien Han
> Senior Principal Software Engineer, Storage Architect
> 
> "Always give 100%. Unless you're giving blood."
> 
> On Fri, Dec 6, 2019 at 3:03 PM Sage Weil <sweil@xxxxxxxxxx> wrote:
> >
> > On Fri, 6 Dec 2019, Sebastien Han wrote:
> > > If not in ceph-osd, can we have the ceph-osd executing a hook before exiting 0?
> > > Reading a hook script from /etc/ceph/hook.d something like that would
> > > be nice so that we don't need a wrapper.
> >
> > Hmm, maybe if it was just osd_exec_on_shutdown=string, and that could
> > be something like "vgchange ..." or "bash -c ..."?  We'd need to make
> > sure we're setting FD_CLOEXEC on all the right file handles though.  I can
> > give it a go..
> >
> > sage
> >
> > >
> > > Thoughts?
> > >
> > > Thanks!
> > > –––––––––
> > > Sébastien Han
> > > Senior Principal Software Engineer, Storage Architect
> > >
> > > "Always give 100%. Unless you're giving blood."
> > >
> > > On Fri, Dec 6, 2019 at 2:50 PM Sage Weil <sweil@xxxxxxxxxx> wrote:
> > > >
> > > > On Fri, 6 Dec 2019, Sebastien Han wrote:
> > > > > I understand this is asking a lot from the ceph-volume side.
> > > > > We can explore a new wrapper binary or perhaps from the ceph-osd itself.
> > > > >
> > > > > Maybe crazy/stupid idea, can we have a de-activate call from the osd
> > > > > process itself? ceph-osd gets SIGTERM, closes the connection to the
> > > > > device, then runs "vgchange -an <vg>", is this realistic?
> > > >
> > > > Not really... it's hard (or gross) to do a hard/immediate exit that tears
> > > > down all of the open handles to the device.  I think this is not a nice
> > > > way to layer things.  I'd prefer either a c-v command or separate wrapper
> > > > script to this.
> > > >
> > > > sage
> > > >
> > > >
> > > > >
> > > > > Thanks!
> > > > > –––––––––
> > > > > Sébastien Han
> > > > > Senior Principal Software Engineer, Storage Architect
> > > > >
> > > > > "Always give 100%. Unless you're giving blood."
> > > > >
> > > > > On Fri, Dec 6, 2019 at 1:44 PM Alfredo Deza <adeza@xxxxxxxxxx> wrote:
> > > > > >
> > > > > > On Fri, Dec 6, 2019 at 5:59 AM Sebastien Han <shan@xxxxxxxxxx> wrote:
> > > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > Following up on my previous ceph-volume email as promised.
> > > > > > >
> > > > > > > When running Ceph with Rook in Kubernetes in the Cloud (Aws, Azure,
> > > > > > > Google, whatever), the OSDs are backed by PVC (Cloud block storage)
> > > > > > > attached to virtual machines.
> > > > > > > This makes the storage portable if the VM dies, the device will be
> > > > > > > attached to a new virtual machine and the OSD will resume running.
> > > > > > >
> > > > > > > In Rook, we have 2 main deployments for the OSD:
> > > > > > >
> > > > > > > 1. Prepare the disk to become an OSD
> > > > > > > Prepare will run on the VM, attach the block device, run "ceph-volume
> > > > > > > prepare", then this gets complicated. After this, the device is
> > > > > > > supposed to be detached from the VM because the container terminated.
> > > > > > > However, the block is still held by LVM so the VG must be
> > > > > > > de-activated. Currently, we do this in Rook, but it would be nice to
> > > > > > > de-activate the VG once ceph-volume is done preparing the disk in a
> > > > > > > container.
> > > > > > >
> > > > > > > 2. Activate the OSD.
> > > > > > > Now, onto the new container, the device is attached again on the VM.
> > > > > > > At this point, more changes will be required in ceph-volume,
> > > > > > > particularly in the "activate" call.
> > > > > > >   a. ceph-volume should activate the VG
> > > > > >
> > > > > > By VG you mean LVM's Volume Group?
> > > > > >
> > > > > > >   b. ceph-volume should activate the device normally
> > > > > >
> > > > > > Not "normally" though right? That would imply starting the OSD which
> > > > > > you are indicating is not desired.
> > > > > >
> > > > > > >   c. ceph-volume should run the ceph-osd process in foreground as well
> > > > > > > as accepting flag to that CLI, we could have something like:
> > > > > > > "ceph-volume lvm activate --no-systemd $STORE_FALG $OSD_ID $OSD_UUID
> > > > > > > <a bunch of flags>"
> > > > > > >   Perhaps we need a new flag to indicate we want to run the osd
> > > > > > > process in foreground?
> > > > > > >   Here is an example on how an OSD run today:
> > > > > > >
> > > > > > >   ceph-osd --foreground --id 2 --fsid
> > > > > > > 9a531951-50f2-4d48-b012-0aef0febc301 --setuser ceph --setgroup ceph
> > > > > > > --crush-location=root=default host=minikube --default-log-to-file
> > > > > > > false --ms-learn-addr-from-peer=false
> > > > > > >
> > > > > > >   --> we can have a bunch of flags or an ENV var with all the flags
> > > > > > > whatever you prefer.
> > > > > > >
> > > > > > >   This wrapper should watch for signals too, it should reply to
> > > > > > > SIGTERM in the following way:
> > > > > > >     - stop the OSD
> > > > > > >     - de-activate the VG
> > > > > > >     - exit 0
> > > > > > >
> > > > > > > Just a side note, the VG must be de-activated when the container stops
> > > > > > > so that the block device can be detached from the VMs, otherwise,
> > > > > > > it'll still be held by LVM.
> > > > > >
> > > > > > I am worried that this goes beyond what I consider the scope of
> > > > > > ceph-volume which is: prepare device(s) to be part of an OSD.
> > > > > >
> > > > > > Catching signals, handling the OSD in the foreground, and accepting
> > > > > > (proxying) flags, sounds problematic for a robust implementation in
> > > > > > ceph-volume, even
> > > > > > if that means it will help Rook in this case.
> > > > > >
> > > > > > The other challenge I see is that it seems Ceph is in a transition
> > > > > > from being a baremetal project to a container one, except lots of
> > > > > > tooling (like ceph-volume) is deeply
> > > > > > tied to the non-containerized workflows. This makes it difficult (and
> > > > > > non-obvious!) in ceph-volume when adding more flags to do things that
> > > > > > help the containerized
> > > > > > deployment.
> > > > > >
> > > > > > To solve the issues you describe, I think you need either a separate
> > > > > > command-line tool that can invoke ceph-volume with the added features
> > > > > > you listed, or
> > > > > > if there is significant push to get more things in ceph-volume, a
> > > > > > separate sub-command, so that the `lvm` is isolated from the
> > > > > > conflicting logic.
> > > > > >
> > > > > > My preference would be a wrapper script, separate from the Ceph project.
> > > > > >
> > > > > > >
> > > > > > > Hopefully, I was clear :).
> > > > > > > This is just a proposal if you feel like this could be done
> > > > > > > differently, feel free to suggest.
> > > > > > >
> > > > > > > Thanks!
> > > > > > > –––––––––
> > > > > > > Sébastien Han
> > > > > > > Senior Principal Software Engineer, Storage Architect
> > > > > > >
> > > > > > > "Always give 100%. Unless you're giving blood."
> > > > > > > _______________________________________________
> > > > > > > Dev mailing list -- dev@xxxxxxx
> > > > > > > To unsubscribe send an email to dev-leave@xxxxxxx
> > > > > >
> > > > > _______________________________________________
> > > > > Dev mailing list -- dev@xxxxxxx
> > > > > To unsubscribe send an email to dev-leave@xxxxxxx
> > > > >
> > > _______________________________________________
> > > Dev mailing list -- dev@xxxxxxx
> > > To unsubscribe send an email to dev-leave@xxxxxxx
> > >
> _______________________________________________
> Dev mailing list -- dev@xxxxxxx
> To unsubscribe send an email to dev-leave@xxxxxxx
> 
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx

[Index of Archives]     [CEPH Users]     [Ceph Devel]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux