Hi, I figured I should follow up on this discussion, not with the intention of bashing any particular solution, but pointing to at least one current major challenge with cephadm. As I wrote earlier in the thread, we previously found it ... challenging to debug things running in cephadm. Earlier this week it appears we too were hit by the bug where cephadm removes monitors from the monmap ( https://tracker.ceph.com/issues/51027 ) if the node is rebooted. Presently our cluster is offline, because there's still no fix, and every single piece of documentation for things like monmaptool appears to assume it's running natively, not through cephadm. There's also the additional fragility that all the "ceph orch" commands themselves stop working (even a simple status request just hangs) if the ceph cluster itself is down. I suspect we'll find ways around that, but when reflecting I have a few thoughts: 1. It is significantly harder than one thinks to develop a stable orchestrating environment. We've been happy with both salt & ansible, but on balance cephadm appears quite fragile - and I'm not sure if it will ever be realistic to invest the amount of work required to make it as stable. There are of course many advantages to having something closely tied to the specific solution (ceph) - but in hindsight that seems to only have been an advantage in sunny weather. Once the service itself is down, I think it is a clear & major drawback that suddenly your orchestrator also stops responding. Long-term, if cephadm is the solution, I think it's important that it works even when the ceph services themselves are down. 2. I think ceph - in particular the documentation - suffers from too many different ways of doing things (raw packages, or rook, or cephadm, which in turn can use either docker or podman, etc.), which again is a pain the second you need to debug or fix anything. If the decision is that cephadm is the way things should work, so be it, but then all documentation has to actually reflect how to do things in a cephadm environment (and not e.g. assuming all the containers are running so you can log in to the right container first). How do you extract a monman in a cephadm cluster, for instance? Just following the default documentation produces errors. Presently I feel the short-term solution has been to allow multiple different ways of doing things. As a developer I can understand that, but as a user it's a nightmare unless somebody takes the time to properly update all documentation with two (or more) choices describing how to do things (a) natively, or (b) in a cephadm cluster. Again, this is meant as hopefully constructive feedback rather than complaints, but the feeling a get after having had fairly smooth operations with raw packages (including fixing previous bugs leading to severe crashes) and lately grinding our teeth a bit over cephadm is that it has helped automated a bunch of stuff that wasn't particularly difficult (it's nice to issue an update with a single command, but it works perfectly fine manually too) at the cost of making it WAY more difficult to fix things (not to mention simply get information about the cluster) when we have problems - and in the long run that's not a trade-off I'm entirely happy with :-) Cheers, Erik On Tue, Jun 29, 2021 at 1:25 AM Sage Weil <sage@xxxxxxxxxxxx> wrote: > On Fri, Jun 25, 2021 at 10:27 AM Nico Schottelius > <nico.schottelius@xxxxxxxxxxx> wrote: > > Hey Sage, > > > > Sage Weil <sage@xxxxxxxxxxxx> writes: > > > Thank you for bringing this up. This is in fact a key reason why the > > > orchestration abstraction works the way it does--to allow other > > > runtime environments to be supported (FreeBSD! > > > sysvinit/Devuan/whatever for systemd haters!) > > > > I would like you to stop labeling people who have reasons for not using > > a specific software as haters. > > > > It is not productive to call Ceph developers "GlusterFS haters", nor to > > call Redhat users Debian haters. > > > > It is simple not an accurate representation. > > You're right, and I apologize. My intention was to point out that we > tried to keep the door open to everyone, even those who might be > called "haters", but I clearly missed the mark. > > sage > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > -- Erik Lindahl <erik.lindahl@xxxxxxxxx> Professor of Biophysics, Dept. Biochemistry & Biophysics, Stockholm University Science for Life Laboratory, Box 1031, 17121 Solna, Sweden Note: I frequently do email outside office hours because it is a convenient time for me to write, but please do not interpret that as an expectation for you to respond outside your work hours. _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx