Dear Ceph community, we are in the curious situation that typical orchestrator queries provide wrong or outdated information about different services. E.g. `ceph orch ls` will report wrong numbers on active services. Or `ceph orch ps` reports many OSDs as "starting" and many services with an old version (15.2.14, but we are on 16.2.7). Also the refresh times seem way of (capital M == months?). However, the cluster is healthy (`ceph status` is happy). And sample validation of affected services with systemctl also shows that they are up and ok. We already tried the following without success: a) re-registering cephadm as orchestrator backend 0|0[root@osd-1 ~]# ceph orch pause 0|0[root@osd-1 ~]# ceph orch set backend '' 0|0[root@osd-1 ~]# ceph mgr module disable cephadm 0|0[root@osd-1 ~]# ceph orch ls Error ENOENT: No orchestrator configured (try `ceph orch set backend`) 0|0[root@osd-1 ~]# ceph mgr module enable cephadm 0|0[root@osd-1 ~]# ceph orch set backend 'cephadm' b) a failover of the MGR (hoping it would restart/reset the orchestrator module) 0|0[root@osd-1 ~]# ceph status | grep mgr mgr: osd-1(active, since 6m), standbys: osd-5.jcfyqe, osd-4.oylrhe, osd-3 0|0[root@osd-1 ~]# ceph mgr fail 0|0[root@osd-1 ~]# ceph status | grep mgr mgr: osd-5.jcfyqe(active, since 7s), standbys: osd-4.oylrhe, osd-3, osd-1 Is there any other way to somehow reset the orchestrator information/connection? I added different relevant outputs below. I also went through the MGR logs and found an issue with querying the docker repos. I attempted to upgrade the MGRs to 16.2.9 a few weeks ago due to a different bug. But this upgrade never went through. Apparently due to cephadm not being able to pull the image. Interestingly, I'm able to pull the image manually with docker pull. But cephadm is not. I also get an error with `ceph orch upgrade ls` to check on available versions. I'm not sure, if this is relevant to the orchestrator problem we have. But to be safe, I also added the logs/output below. Thank you for all your help! Best Wishes, Mathias 0|0[root@osd-1 ~]# ceph status cluster: id: 55633ec3-6c0c-4a02-990c-0f87e0f7a01f health: HEALTH_OK services: mon: 5 daemons, quorum osd-1,osd-2,osd-5,osd-4,osd-3 (age 86m) mgr: osd-5.jcfyqe(active, since 21m), standbys: osd-4.oylrhe, osd-3, osd-1 mds: 1/1 daemons up, 1 standby osd: 270 osds: 270 up (since 13d), 270 in (since 5w) cephfs-mirror: 1 daemon active (1 hosts) rgw: 3 daemons active (3 hosts, 2 zones) data: volumes: 1/1 healthy pools: 17 pools, 6144 pgs objects: 692.54M objects, 1.2 PiB usage: 1.8 PiB used, 1.7 PiB / 3.5 PiB avail pgs: 6114 active+clean 29 active+clean+scrubbing+deep 1 active+clean+scrubbing io: client: 0 B/s rd, 421 MiB/s wr, 52 op/s rd, 240 op/s wr 0|0[root@osd-1 ~]# ceph orch ls NAME PORTS RUNNING REFRESHED AGE PLACEMENT alertmanager ?:9093,9094 0/1 - 8M count:1 cephfs-mirror 0/1 - 5M count:1 crash 2/6 7M ago 4M * grafana ?:3000 0/1 - 8M count:1 ingress.rgw.default 172.16.39.131:443,1967 0/2 - 4M osd-1 ingress.rgw.ext 172.16.39.132:443,1968 4/2 7M ago 4M osd-5 ingress.rgw.ext-website 172.16.39.133:443,1969 0/2 - 4M osd-3 mds.cephfs 2/2 9M ago 4M count-per-host:1;label:mds mgr 5/5 9M ago 9M count:5 mon 5/5 9M ago 9M count:5 node-exporter ?:9100 2/6 7M ago 7w * osd.all-available-devices 0 - 5w * osd.osd 54 <deleting> 7M label:osd osd.unmanaged 180 9M ago - <unmanaged> prometheus ?:9095 0/2 - 8M count:2 rgw.cubi 4/0 9M ago - <unmanaged> rgw.default ?:8100 2/1 7M ago 4M osd-1 rgw.ext ?:8100 2/1 7M ago 4M osd-5 rgw.ext-website ?:8200 0/1 - 4M osd-3 0|0[root@osd-1 ~]# ceph orch ps | grep starting | head -n 3 osd.0 osd-1 starting - - - 3072M <unknown> <unknown> <unknown> osd.1 osd-2 starting - - - 3072M <unknown> <unknown> <unknown> osd.10 osd-1 starting - - - 3072M <unknown> <unknown> <unknown> 0|0[root@osd-1 ~]# ceph orch ps | grep 15.2.14 | head -n 3 mds.cephfs.osd-1.fhmalo osd-1 running (9M) 9M ago 9M 370M - 15.2.14 d4c4064fa0de f138649b2e4f mds.cephfs.osd-2.vqanmk osd-2 running (9M) 9M ago 9M 3119M - 15.2.14 d4c4064fa0de a2752217770f osd.100 osd-1 running (9M) 9M ago 9M 3525M 3072M 15.2.14 d4c4064fa0de 1ea3fc9c3caf 0|0[root@osd-1 ~]# cephadm version Using recent ceph image quay.io/ceph/ceph@sha256:bb6a71f7f481985f6d3b358e3b9ef64c6755b3db5aa53198e0aac38be5c8ae54 ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific (stable) 0|0[root@osd-1 ~]# ceph versions { "mon": { "ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific (stable)": 5 }, "mgr": { "ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific (stable)": 4 }, "osd": { "ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific (stable)": 270 }, "mds": { "ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific (stable)": 2 }, "cephfs-mirror": { "ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific (stable)": 1 }, "rgw": { "ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific (stable)": 3 }, "overall": { "ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific (stable)": 285 } } From MGR logs: Jun 29 09:00:07 osd-5 bash[9702]: debug 2022-06-29T07:00:07.046+0000 7fdd4e467700 0 [cephadm ERROR cephadm.serve] cephadm exited with an error code: 1, stderr:Pulling container image quay.io/ceph/ceph:v16.2.9... Jun 29 09:00:07 osd-5 bash[9702]: Non-zero exit code 1 from /bin/docker pull quay.io/ceph/ceph:v16.2.9 Jun 29 09:00:07 osd-5 bash[9702]: /bin/docker: stderr Error response from daemon: Get "https://quay.io/v2/": context deadline exceeded (Client.Timeout exceeded while awaiting headers) Jun 29 09:00:07 osd-5 bash[9702]: ERROR: Failed command: /bin/docker pull quay.io/ceph/ceph:v16.2.9 Jun 29 09:00:07 osd-5 bash[9702]: Traceback (most recent call last): Jun 29 09:00:07 osd-5 bash[9702]: File "/usr/share/ceph/mgr/cephadm/serve.py", line 1363, in _remote_connection Jun 29 09:00:07 osd-5 bash[9702]: yield (conn, connr) Jun 29 09:00:07 osd-5 bash[9702]: File "/usr/share/ceph/mgr/cephadm/serve.py", line 1256, in _run_cephadm Jun 29 09:00:07 osd-5 bash[9702]: code, '\n'.join(err))) 0|0[root@osd-1 ~]# ceph orch upgrade ls Error EINVAL: Traceback (most recent call last): File "/usr/share/ceph/mgr/mgr_module.py", line 1384, in _handle_command return self.handle_command(inbuf, cmd) File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 168, in handle_command return dispatch[cmd['prefix']].call(self, cmd, inbuf) File "/usr/share/ceph/mgr/mgr_module.py", line 397, in call return self.func(mgr, **kwargs) File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 107, in <lambda> wrapper_copy = lambda *l_args, **l_kwargs: wrapper(*l_args, **l_kwargs) # noqa: E731 File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 96, in wrapper return func(*args, **kwargs) File "/usr/share/ceph/mgr/orchestrator/module.py", line 1337, in _upgrade_ls r = raise_if_exception(completion) File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 225, in raise_if_exception raise e requests.exceptions.ConnectionError: None: Max retries exceeded with url: /v2/ceph/ceph/tags/list (Caused by None) _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx