Hello, Wall of text warning! :-) I am playing with a new 5 cluster install (actually now 4 since the fifth server isn't started yet), reading and following the doc and watching what happens. Pretty weirdly multiple things went wrong which almost never happened with ceph in the past many years, and I feel cephadm is still not quite ready for prime time. I considered opening issues but maybe it's better to filter user errors from problems first, so I depend on your expertise and help. I decided to use pacific and podman 3.x, and first obstacle was the mismatching documetation but that's already https://tracker.ceph.com/issues/55053 and I have resolved it by manual fiddling, so I ought to be on real pacific now, fingers crossed. Small problem with first stages of installing ceph on debian stable (bullseye): ceph repo keys are probably in old format so they aren't automagically recognised, require apt-key add manually. Requirements neglect to mention a few small things, like examples need curl and I am not sure whether missing firewalld caused some of the pains later (at least I have seen complaining and error mentioning it, but maybe that's just to b ignored). I think the Doc doesn't specifically mention that hostnames (especially when using bare hostnames) shall be in the mon network prefix, which is important if the host has a separate management subnet from mon and osd subnets. Nevertheless default install of prometheus and the dashboard seems to be broken because it tries to use IP instead of hostnames and fail on the SSL cert not containing IP (as they shouldn't, really); this needs a bit of googling and manual setting of URLs and things, otherwise the dashboard is full of 500 errors without any explanation (since it only gets logged into syslog in a very ugly and pretty verbose way). My main problems are, however, started when adding new hosts. Everything was connectable [ssh keys], requirements fulfilled (apart from the mentioned firewalld), except some modifications _suggested_ a reboot, which I didn't follow on two hosts (out of 4 online). The result was this: two working servers (master and the one I rebooted for a different reason), and two non-working ones. The nonworking mean state where master thinks they have mon/crash/vol/mgr running, but nothing was running because podman died with a mysterious error; for the record: ceph: stderr Error: cannot open sd-bus: No such file or directory: OCI not found it was caused by not rebooting, as it turned out. However after reboot the daemons were started, except they didn't join the cluster. Also ceph orch had little effect on their behaviour, while cephadm seemed to be able to do actions there, starting or stopping daemons, without much use (so it isn't a connectivity issue). I have tried to `ceph orch host (drain|rm)` them, which more or less succeeded. It have resulted "stopping" state of the daemons' view on the master... which still blocks rm. Tried to remove them manually, with success. Then host rm. Worked. Adding back the hosts (`ceph orch add host <name> <ip> _admin`) gave the result of all the daemons in "starting" on master view, while visibly running on the host, and not joining the cluster. It stays "starting" forever. root@alai:~# ceph orch ps olhado --refresh NAME HOST PORTS STATUS REFRESHED AGE MEM USE MEM LIM VERSION IMAGE ID crash.olhado olhado starting - - - - <unknown> <unknown> mon.olhado olhado starting - - - 2048M <unknown> <unknown> node-exporter.olhado olhado *:9100 starting - - - - <unknown> <unknown> The next problem is to find the logs. It used to be so easy, go to /var/log/ceph/ and find ceph-mon.0.log and likes. Not anymore. I guess mon logs into syslog, using the generic name "conmon", probably mixed up with the other daemons? (I have seen in the docs that I could force it to create logfiles again, didn't feel like yet another reconfiguration.) Anyway, I see no error in the host syslog from mon, it said: Mar 25 11:55:40 olhado conmon[33551]: debug 2022-03-25T10:55:40.083+0000 7fa24f6b0700 1 mon.olhado@-1(synchronizing) e2 sync_obtain_latest_monmap Mar 25 11:55:40 olhado conmon[33551]: debug 2022-03-25T10:55:40.083+0000 7fa24f6b0700 1 mon.olhado@-1(synchronizing) e2 sync_obtain_latest_monmap obtained monmap e2 and the master says, every second: Mar 25 14:19:24 alai conmon[822444]: debug 2022-03-25T13:19:24.729+0000 7f09c18df700 1 mon.alai@0(leader) e2 adding peer [v2:***:3300/0,v1:***:6789/0] to list of hints but it's not in `ceph status` or `ceph mon dump` on the master (and local `ceph` gets infinite wait, it's really hard to tell why since it seems to be communicating with the master mon). I have tried in due course putting them in maintenance mode since I tried to get rid of "starting" state, first using ceph orch then by local cephadm. These two does not seem to communicate well, which is aptly demonstrated here: root@alai:~# ceph health detail HEALTH_WARN 1 host is in maintenance mode [WRN] HOST_IN_MAINTENANCE: 1 host is in maintenance mode olhado is in maintenance root@alai:~# ceph orch host maintenance exit olhado Error EINVAL: Host olhado is not in maintenance mode I guess I need to find a combo when cephadm and ceph orch see the same state, probably enabling maint by cephadm then trying to remove using ceph orch. I also had a state when `ceph orch *` simply got into forever waiting, and it was only resolvable by stopping ceph-volume on the host using podman stop. I was not able to see any log or debug which would have explained to me what is happening and why. Generally my problem is that I don't (yet?) see a simple way to see what is happening: - when ceph should deploy automagically but doesn't - when cephadm/orch say "starting", "stopping" something and it doesn't change - where are the daemon logs and how to follow them easily (right now I'm using `podman logs ...` but I am not sure it's the proper way) - whether `ceph orch host rm` _really_ removes the host, so it can be added later, or does it need manual deletion of something? It seems it does since the aforementioned maintenance mode seems to have styed throughout removal. I will scratch the thing soon, there is no harm done, apart from the time spent on watching (or rather guessing) what cephadm does. I wonder whether some of these are bugs to be fixed (either cephadm or documentation) or are they all preventable user errors. Sorry for the wall of text. Thanks, Peter _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx