Re: Why you might want packages not containers for Ceph deployments

"Fox, Kevin M" <Kevin.Fox@xxxxxxxx> · Thu, 24 Jun 2021 21:27:51 +0000

I've actually had rook-ceph not proceed with something that I would have continued on with. Turns out I was wrong and it was right. Its checking was more through then mine. Thought that was pretty cool. It eventually cleared itself and finished up.

For a large ceph cluster, the orchestration is very nice.

Thanks,
Kevin

________________________________________
From: Sage Weil <sage@xxxxxxxxxxxx>
Sent: Thursday, June 24, 2021 1:46 PM
To: Marc
Cc: Anthony D'Atri; Nico Schottelius; Matthew Vernon; ceph-users@xxxxxxx
Subject:  Re: Why you might want packages not containers for Ceph deployments

Check twice before you click! This email originated from outside PNNL.

On Sun, Jun 20, 2021 at 9:51 AM Marc <Marc@xxxxxxxxxxxxxxxxx> wrote:
> Remarks about your cephadm approach/design:
>
> 1. I am not interested in learning podman, rook or kubernetes. I am using mesos which is also on my osd nodes to use the extra available memory and cores. Furthermore your cephadm OC is limited to only ceph nodes. While my mesos OC is spread across a larger cluster and has rules when, and when not to run tasks on the osd nodes. You incorrectly assume that rgw, grafana, prometheus, haproxy are going to be ran on your ceph OC.

rgw, grafana, prom, haproxy, etc are all optional components.  The
monitoring stack is deployed by default but is trivially disabled via
a flag to the bootstrap command.  We are well aware that not everyone
wants these, but we cannot ignore the vast majority of users that
wants things to Just Work without figuring out how to properly deploy
and manage all of these extraneous integrated components.

> 2. Nico pointed out that you do not have alpine linux container images. I did not even know you were using container images. So how big are these? Where are these stored. And why are these not as small as they can be? Such an osd container image should be 20MB or so at most. I would even expect statically build binary container image, why even a tiny os?
> 4. Ok found the container images[2] (I think). Sorry but this has ‘nothing’ to do with container thinking. I expected to find container images for osd, msd, rgw separately and smaller. This looks more like an OS deployment.
Early on the team building the container images opted for a single
image that includes all of the daemons for simplicity.  We could build
stripped down images for each daemon type, but that's an investment in
developer time and complexity and we haven't heard any complaints
about the container size.  (Usually a few hundred MB on a large scale
storage server isn't a problem.)

> 3. Why is in this cephadm still being talked about systemd? Your orchestrator should handle restarts,namespaces and failed tasks not? There should be no need to have a systemd dependency, at least I have not seen any container images relying on this.

Something needs to start the ceph daemon containers when the system
reboots.  We integrated with systemd since all major distros adopted
it.  Cephadm could be extended to support other init systems with
pretty minimal effort... we aren't doing anything fancy with systemd.

> 5. I have been writing this previously on the mailing list here. Is each rgw still requiring its own dedicated client id? Is it still true, that if you want to spawn 3 rgw instances, they need to authorize like client.rgw1, client.rgw2 and client.rgw3?
> This does not allow for auto scaling. The idea of using an OC is that you launch a task, and that you can scale this task automatically when necessary. So you would get multiple instances of rgw1. If this is still and issue with rgw, mds and mgr etc. Why even bother doing something with an OC and containers?

The orchestrator automates the creation and cleanup of credentials for
each rgw instance.  (It also trivially scales them up/down, ala k8s.)
If you have an autoscaler, you just need to tell cephadm how many you
want and it will add/remove daemons.  If you are using cephadm's
ingress (haproxy) capability, the LB configuration will be adjusted
for you.  If you are using an external LB, you can query cephadm for a
description of the current daemons and their endpoints and feed that
info into your own ingress solution.

> 6. As I wrote before I do not want my rgw or haproxy running in a OC that has the ability to give tasks capability SYSADMIN. So that would mean I have to run my osd daemons/containers separately.

Only the OSD containers get extra caps to deal with the storage hardware.

> 7. If you are not setting cpu and memory limits on your cephadm containers, then again there is an argument why even use containers.

Memory limits are partially implemented; we haven't gotten to CPU
limits yet.  It's on the list!

> 8. I still see lots of comments on the mailing list about accessing logs. I have all my containers log to a remote syslog server, if you still have your ceph daemons that can not do this (correctly). What point is it even going to containers.

By default, we log to stderr and your logs are in journald or whatever
alternative your container runtime has set up.  You can trivially flip
a switch and you get traditional file-based logs with a logrotated.d
config, primary to satisfy users (like me!) who aren't comfortable
with the newfangled log management style.

> 9. I am updating my small cluster something like this:
>
> ssh root@c01 "ceph osd set noout  ; ceph osd set noscrub ; ceph osd set nodeep-scrub"
> ssh root@c01 "ceph tell osd.* injectargs '--osd_max_scrubs=0'"
>
> ssh root@c01 "yum update 'ceph-*' -y"
> ...
>
> ssh root@c01 "service ceph-mon@a restart"
> ...
>
> ssh root@c01 "service ceph-mgr@a restart"
> ...
>
> # wait for up and recovery to finish
> ssh root@c01  "systemctl restart 'ceph-osd@*'"
> …
>
> I am never going to run a ‘ceph orch upgrade start –ceph-version 16.2.0’. I want to see if everything is ok after each command I issue. I want to see if scrubbing stopped, I want to see if osd have correctly accepted the new config.
> I have a small cluster so I do not see this procedure as a waste of time. If I look at your telemetry data[3]. I see 600 clusters with 35k osd’s, that is an average of 60 osd per cluster. So these are quite small clusters, I would think these admins have a similar point of view as I have.
>
> That leaves these big clusters of >3k osd’s. I wonder what these admins require, are they at CERN really waiting for something like cephadm?

I humbly contend that most users, especially those with small
clusters, would rather issue a single command and have the cluster
upgrade itself--with all of the latest and often version-specific
safety checks and any special per-release steps implemented for
them--than to do it themselves.

sage
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx