Re: Why you might want packages not containers for Ceph deployments

Marc <Marc@xxxxxxxxxxxxxxxxx> · Fri, 25 Jun 2021 13:59:41 +0000

> rgw, grafana, prom, haproxy, etc are all optional components.  The

Is this Prometheus stateful? Where is this data stored?

> Early on the team building the container images opted for a single
> image that includes all of the daemons for simplicity.  We could build
> stripped down images for each daemon type, but that's an investment in
> developer time and complexity and we haven't heard any complaints
> about the container size. (Usually a few hundred MB on a large scale
> storage server isn't a problem.)

To me it looks like you do not take the containerization seriously, a container development team that does not want to spend time on container images. You create something >10x slower to start, >10x more disk space used (times 2 when upgrading). Haproxy is 9MB. Your osd is 350MB.

> > 5. I have been writing this previously on the mailing list here. Is
> each rgw still requiring its own dedicated client id? Is it still true,
> that if you want to spawn 3 rgw instances, they need to authorize like
> client.rgw1, client.rgw2 and client.rgw3?
> > This does not allow for auto scaling. The idea of using an OC is that
> you launch a task, and that you can scale this task automatically when
> necessary. So you would get multiple instances of rgw1. If this is still
> and issue with rgw, mds and mgr etc. Why even bother doing something
> with an OC and containers?
> 
> The orchestrator automates the creation and cleanup of credentials for
> each rgw instance.  (It also trivially scales them up/down, ala k8s.)

I do not understand this. This sounds more to me like creating a new task, instead of scaling a second instance of an existing task. Are you currently able to automatically scale up/down instances of a rgw or is your statement hypothetical?

I can remember on the mesos mailing list/issue tracker talk about the difficulty of determining a tasks 'number' . Because tasks are being killed/started at random, based on resource offers. Thus supplying them with the correct different credentials is not as trivial as it would seem.
So I wonder how you are scaling this? If there are already so many differences between OC's, I would even recon they differ in this area quite a lot. So the most plausible solution would be fixing this in at the rgw daemon.

> If you have an autoscaler, you just need to tell cephadm how many you
> want and it will add/remove daemons.  If you are using cephadm's
> ingress (haproxy) capability, the LB configuration will be adjusted
> for you.  If you are using an external LB, you can query cephadm for a
> description of the current daemons and their endpoints and feed that
> info into your own ingress solution.

Forgive me for not looking at all the video links before writing this. But from the video's I saw about cephadm it was more always like a command reference. Would be nice to maybe show the above in ceph tech talk or so. I think a lot of people would be interested seeing this.

> > 6. As I wrote before I do not want my rgw or haproxy running in a OC
> that has the ability to give tasks capability SYSADMIN. So that would
> mean I have to run my osd daemons/containers separately.
> 
> Only the OSD containers get extra caps to deal with the storage
> hardware.

I know, that is why I choose to run drivers that require such SYSADMIN rights, to run outside of my OC environment. My OC environment does not allow any tasks to use the SYSADMIN.

> Memory limits are partially implemented; we haven't gotten to CPU
> limits yet.  It's on the list!
> 

To me it is sort of clear what the focus of the cephadm team is.

> 
> I humbly contend that most users, 

Hmmmm, most, most, most is not most mostly the average? Most people drive a Toyota, less people drive Porsche and even less drive a Ferrari. It is your choice who your target audience is and what you are 'selling' them.

> especially those with small
> clusters, would rather issue a single command and have the cluster
> upgrade itself--with all of the latest and often version-specific
> safety checks and any special per-release steps implemented for
> them--than to do it themselves.
> 

The flip side to this approach is. That if you guys make a mistake in some script, lots of ceph clusters could go down. 
Is this not a bit of a paradox, a team that has problems with their software dependencies (ceph-ansible/ceph-deploy?), I should blindly trust to script the update of my cluster?

I know I have been very critical/sceptical about this cephadm. Please do also note I just love this ceph storage, and I am advertising whenever possible. So a big thanks to the whole team still!!!

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx