Re: ceph-volume simple disk scenario without LVM for OSD on PVC

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



@Alfredo, I haven't played with "simple" because the activation part
is not an issue. Yes, the sub-command sounds like an option.
@blaine, I know the logic in bash to intercept signals, we have
something that does exactly all of that in ceph-container, it's just
over-engineered in my opinion. And I don't want to rely on too much
bash in Rook.
@Sage, doing this in a new ceph-volume scenario is interesting,
although ideally, I'd like to get rid of another wrapper... Also, this
wrapper must be flexible and we must be able to change/adapt it
quickly, that won't be the case if it lives in ceph-volume because
it's in-tree Ceph (and I don't mean to resurrect another old
discussion here ^^)
@Jan, yes the biggest problem with the prototype is likely that once
it works people will ask for more. I'm not saying lvm is unsolvable,
nothing i. It's just that every time we have to work around it we add
more and more complexity/requirements to the point where I'm having a
hard time understand the value proposition.

One thing, not ideal of course, would be to have a block
implementation and lvm for the more advanced use cases because all the
logic is present.
I'm just afraid this might confuse (curious) users.

Just to reiterate, I'm currently only looking at the simplest scenario
which is the most common one: a Bluestore non-encrypted OSD with
block/db/wal on the same disk.
I'm going to investigate one more time why we need to de-activate the
LV and see if I can find another fix so we don't have to do it
explicitly. As for the signal catching, assuming I can fix the
de-activation, this can go away with fast osd shutdown coming in
Octopus.

Thanks!
–––––––––
Sébastien Han
Senior Principal Software Engineer, Storage Architect

"Always give 100%. Unless you're giving blood."
On Tue, Dec 3, 2019 at 11:59 PM Jan Fajerski <jfajerski@xxxxxxxx> wrote:
>
> On Tue, Dec 03, 2019 at 05:55:25PM +0100, Sebastien Han wrote:
> >Hi,
> >
> >I've started working on a saner way to deploy OSD with Rook so that
> >they don't use the rook binary image.
> >
> >Why were/are we using the rook binary to activate the OSD?
> >
> >A bit of background on containers first, when executing a container,
> >we need to provide a command entrypoint that will act as PID 1. So if
> >you want to do pre/post action before running the process you need to
> >use a wrapper. In Rook, that's the rook binary, which has a CLI and
> >can then "activate" an OSD.
> >Currently, this "rook osd activate" call does the following:
> >
> >* sed the lvm.conf
> >* run c-v lvm activate
> >* run the osd process
> >
> >On shutdown, we intercept the signal, "kill -9" the osd and de-activate the LV.
> >
> >I have a patch here: https://github.com/rook/rook/pull/4386, that
> >solves the initial bullet points but one thing we cannot do is the
> >signal catching and the lv de-activation.
> >Before you ask, Kubernetes has pre/post-hook but they are not
> >reliable, it's known and documented that there is no guarantee they
> >would actually run before or after the container starts/stops. We
> >tried and we had issues.
> >
> >Why do we want to stop using the rook binary for activation? Because
> >each time we get a new binary version (new operator version), this
> >will restart all the OSDs, even if the deployment spec didn't change,
> >at least if nothing else than the rook image version changed.
> >
> >Also with containers, we have seen so many issues working with LVM,
> >just to name a few:
> >
> >* adapt lvm filters
> >* interactions with udev - need to tune the lvm config, even c-v
> >itself has lvm flag to not sync with udev built-in
> >* several bindmounts
> >* lvm package must be present on the host even if running in containers
> >* SELinux, yes lvm calls SELinux commands under the hood and pollute
> >the logs in some scenarios
>
> I have only seen the last issue and that was a silly bug that was easily fixed.
> The others also sound like they can be fixed with reasonable effort. Is there
> anything that is technically hard to solve? It seems like dealing with config
> files and system infrastructure is just the normal pain of a deployment tool.
> >
> >Currently, one of the ways I can see this working is by not using LVM
> >when bootstrapping OSDs. Unfortunately, some of the logic cannot go in
> >the OSD code since the lv de-activation happens after the OSD stops.
> >We need to de-activate the LV so when running in the Cloud the block
> >can safely be re-attached to a new machine without LVM issues.
> >
> >I know this will be a bit challenging and might ultimately look like
> >ceph-disk but it'd be nice to consider it.
> >What about a small prototype for Bluestore with block/db/wal on the same disk?
> >
> >If this gets rejected, I might try a prototype for not using c-v in
> >Rook or something else that might come up with this discussion.
> I have discussed this before (using bluestore) and I'm happy to write/look at a
> prototype. I don't however think that this solves all those issues listed once
> one factors in the feature set that user will expect. Just thinking about
> multi-device OSDs leads to partitions (which are much more fickle to setup) and
> encryption adds a ton of complexity. And those are features that user rely on
> today.
> Not to mention that we haven't even looked at some features yet (using dm-flakey
> for testing, caching on the block layer).
>
> Is the pain of using lvm so big that it seems unsolvable? I'd be happy to
> explore if this can't be solved with lvm in the picture.
> >
> >Thanks!
> >–––––––––
> >Sébastien Han
> >Senior Principal Software Engineer, Storage Architect
> >
> >"Always give 100%. Unless you're giving blood."
> >_______________________________________________
> >Dev mailing list -- dev@xxxxxxx
> >To unsubscribe send an email to dev-leave@xxxxxxx
>
> --
> Jan Fajerski
> Senior Software Engineer Enterprise Storage
> SUSE Software Solutions Germany GmbH
> Maxfeldstr. 5, 90409 Nürnberg, Germany
> (HRB 36809, AG Nürnberg)
> Geschäftsführer: Felix Imendörffer
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx




[Index of Archives]     [CEPH Users]     [Ceph Devel]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux