Hey folks, I'm working through some basic ops drills, and noticed what I think is an inconsistency in the Cephadm Docs. Some Googling appears to show this is a known thing, but I didn't find a clear direction on cooking up a solution yet. On a cluster with 5 mons, 2 were abruptly removed when their host OS decided to do scheduled maintenance without asking first. Those hosts only had mons running on them (and mds/crash/node exporter), so I still have 3 mon quorum and the cluster is happy. It's not clear to me how I add these hosts back in as mons though. In the troubleshooting docs it describes bringing all mons down, then extracting a monmap. I tried this through various iterations of bringing all down, bringing one back up and entering the container; bringing all down and trying to use ceph-mon from cephadm shell and so on. I either got rocksdb lock issues presumably because a mon node was running, or an error that the path to the mon data didn't exist, presumably for the opposite reason. Is there guidance on the container-friendly way to perform the monmap maintenance? I did think that because I still have quorum, I could simply do ceph orch apply mon label:mon instead, but I am nervous this might upset my remaining mons. Looking at the ceph orch ls output I see: root@kida:/# ceph orch ls NAME PORTS RUNNING REFRESHED AGE PLACEMENT alertmanager 1/1 7m ago 2h count:1 crash 5/5 9m ago 2h * grafana 1/1 7m ago 2h count:1 mds.media 3/3 9m ago 2h thebends;okcomputer;amnesiac mgr 2/2 9m ago 2h count:2 mon 3/5 9m ago 2h label:mon node-exporter 5/5 9m ago 2h * osd.all-available-devices 5/10 9m ago 2h * prometheus 1/1 7m ago 2h count:1 root@kida:/# So is it expecting 2 more mons, or has it autoscaled down cleverly? Looking at ceph orch ps I see: root@kida:/# ceph orch ps NAME HOST PORTS STATUS REFRESHED AGE VERSION IMAGE ID CONTAINER ID alertmanager.kida kida *:9093,9094 running (2h) 8m ago 2h 0.20.0 0881eb8f169f 89c604455194 crash.amnesiac amnesiac running (11h) 8m ago 11h 16.2.4 8d91d370c2b8 bff086c930db crash.kida kida running (2h) 8m ago 2h 16.2.4 8d91d370c2b8 b0ac059be109 crash.kingoflimbs kingoflimbs running (13h) 8m ago 13h 16.2.4 8d91d370c2b8 b0955309a8b9 crash.okcomputer okcomputer running (2h) 10m ago 2h 16.2.4 8d91d370c2b8 a75cf65ef235 crash.thebends thebends running (2h) 8m ago 2h 16.2.4 8d91d370c2b8 befe9c1015f3 grafana.kida kida *:3000 running (2h) 8m ago 2h 6.7.4 ae5c36c3d3cd f85747138299 mds.media.amnesiac.uujwlk amnesiac running (11h) 8m ago 2h 16.2.4 8d91d370c2b8 512a2fcc0f97 mds.media.okcomputer.nednib okcomputer running (2h) 10m ago 2h 16.2.4 8d91d370c2b8 10c6244a9308 mds.media.thebends.pqsfeb thebends running (2h) 8m ago 2h 16.2.4 8d91d370c2b8 c1b75831a973 mgr.kida.kchysa kida *:9283 running (2h) 8m ago 2h 16.2.4 8d91d370c2b8 602acc0d8df3 mgr.okcomputer.rjtrqw okcomputer *:8443,9283 running (2h) 10m ago 2h 16.2.4 8d91d370c2b8 605a8a25a604 mon.amnesiac amnesiac stopped 8m ago 2h <unknown> <unknown> <unknown> mon.kida kida running (2h) 8m ago 2h 16.2.4 8d91d370c2b8 a441563a978d mon.kingoflimbs kingoflimbs stopped 8m ago 2h <unknown> <unknown> <unknown> mon.okcomputer okcomputer running (2h) 10m ago 2h 16.2.4 8d91d370c2b8 c4297efafe27 mon.thebends thebends running (2h) 8m ago 2h 16.2.4 8d91d370c2b8 e2394d5f152b node-exporter.amnesiac amnesiac *:9100 running (11h) 8m ago 2h 0.18.1 e5a616e4b9cf da3c69057c4f node-exporter.kida kida *:9100 running (2h) 8m ago 2h 0.18.1 e5a616e4b9cf 5c9219a29257 node-exporter.kingoflimbs kingoflimbs *:9100 running (13h) 8m ago 2h 0.18.1 e5a616e4b9cf c2236491fb6e node-exporter.okcomputer okcomputer *:9100 running (2h) 10m ago 2h 0.18.1 e5a616e4b9cf 2e53a82eed32 node-exporter.thebends thebends *:9100 running (2h) 8m ago 2h 0.18.1 e5a616e4b9cf def6bdd359d6 osd.0 kida running (2h) 8m ago 2h 16.2.4 8d91d370c2b8 c1419a29ddd8 osd.1 kida running (85m) 8m ago 2h 16.2.4 8d91d370c2b8 dcb172c628ec osd.2 thebends running (2h) 8m ago 2h 16.2.4 8d91d370c2b8 4826e3da8d14 osd.3 okcomputer running (2h) 10m ago 2h 16.2.4 8d91d370c2b8 5424d437c270 osd.4 thebends running (2h) 8m ago 2h 16.2.4 8d91d370c2b8 47e682c3727d prometheus.kida kida *:9095 running (2h) 8m ago 2h 2.18.1 de242295e225 4c8e7fdd89a8 root@kida:/# So those mon containers are still there, stopped. ceph orch daemon restart mon.amnesiac gives notice that a restart is scheduled on that mon. The container status updates in ceph orch ps to running, but version, image ID and container ID are <unknown> and I don't see that mon unit in any status output or log. cephadm unit --name mon.amnesiac restart --fsid yadda-yadda-yadda errors with daemon not found, it seems like the cephadm cli command is scoped to the daemons running on the same host it's being executed on, rather than cluster-wide like ceph orch. Any clues offered to further investigation are welcomed. Best regards Phil _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx