We noticed that our DNS settings were inconsistent and partially wrong. The NetworkManager somehow set useless nameservers in the /etc/resolv.conf of our hosts. But in particular, the DNS settings in the MGR containers needed fixing as well. I fixed etc/resolv.conf on our hosts and in the container of the active MGR daemon. This fixed all the issues I described, including the output orch ps and orch ls as well as registry queries such as docker pull and upgrade ls. Afterwards, I was able to do an upgrade to Quinzy. And as far as I can tell, the newly deployed MGR containers picked up the proper DNS settings from the hosts. Best, Mathias On 6/29/2022 10:45 AM, Mathias Kuhring wrote: > Dear Ceph community, > > we are in the curious situation that typical orchestrator queries > provide wrong or outdated information about different services. > E.g. `ceph orch ls` will report wrong numbers on active services. > Or `ceph orch ps` reports many OSDs as "starting" and many services > with an old version (15.2.14, but we are on 16.2.7). > Also the refresh times seem way of (capital M == months?). > However, the cluster is healthy (`ceph status` is happy). > And sample validation of affected services with systemctl also shows > that they are up and ok. > > We already tried the following without success: > > a) re-registering cephadm as orchestrator backend > 0|0[root@osd-1 ~]# ceph orch pause > 0|0[root@osd-1 ~]# ceph orch set backend '' > 0|0[root@osd-1 ~]# ceph mgr module disable cephadm > 0|0[root@osd-1 ~]# ceph orch ls > Error ENOENT: No orchestrator configured (try `ceph orch set backend`) > 0|0[root@osd-1 ~]# ceph mgr module enable cephadm > 0|0[root@osd-1 ~]# ceph orch set backend 'cephadm' > > b) a failover of the MGR (hoping it would restart/reset the > orchestrator module) > 0|0[root@osd-1 ~]# ceph status | grep mgr > mgr: osd-1(active, since 6m), standbys: osd-5.jcfyqe, > osd-4.oylrhe, osd-3 > 0|0[root@osd-1 ~]# ceph mgr fail > 0|0[root@osd-1 ~]# ceph status | grep mgr > mgr: osd-5.jcfyqe(active, since 7s), standbys: > osd-4.oylrhe, osd-3, osd-1 > > Is there any other way to somehow reset the orchestrator > information/connection? > I added different relevant outputs below. > > I also went through the MGR logs and found an issue with querying the > docker repos. > I attempted to upgrade the MGRs to 16.2.9 a few weeks ago due to a > different bug. > But this upgrade never went through. > Apparently due to cephadm not being able to pull the image. > Interestingly, I'm able to pull the image manually with docker pull. > But cephadm is not. > I also get an error with `ceph orch upgrade ls` to check on available > versions. > I'm not sure, if this is relevant to the orchestrator problem we have. > But to be safe, I also added the logs/output below. > > Thank you for all your help! > > Best Wishes, > Mathias > > > 0|0[root@osd-1 ~]# ceph status > cluster: > id: 55633ec3-6c0c-4a02-990c-0f87e0f7a01f > health: HEALTH_OK > > services: > mon: 5 daemons, quorum osd-1,osd-2,osd-5,osd-4,osd-3 > (age 86m) > mgr: osd-5.jcfyqe(active, since 21m), standbys: > osd-4.oylrhe, osd-3, osd-1 > mds: 1/1 daemons up, 1 standby > osd: 270 osds: 270 up (since 13d), 270 in (since 5w) > cephfs-mirror: 1 daemon active (1 hosts) > rgw: 3 daemons active (3 hosts, 2 zones) > > data: > volumes: 1/1 healthy > pools: 17 pools, 6144 pgs > objects: 692.54M objects, 1.2 PiB > usage: 1.8 PiB used, 1.7 PiB / 3.5 PiB avail > pgs: 6114 active+clean > 29 active+clean+scrubbing+deep > 1 active+clean+scrubbing > > io: > client: 0 B/s rd, 421 MiB/s wr, 52 op/s rd, 240 op/s wr > > 0|0[root@osd-1 ~]# ceph orch ls > NAME PORTS RUNNING REFRESHED > AGE PLACEMENT > alertmanager ?:9093,9094 0/1 - > 8M count:1 > cephfs-mirror 0/1 - > 5M count:1 > crash 2/6 7M > ago 4M * > grafana ?:3000 0/1 - > 8M count:1 > ingress.rgw.default 172.16.39.131:443,1967 0/2 - > 4M osd-1 > ingress.rgw.ext 172.16.39.132:443,1968 4/2 7M > ago 4M osd-5 > ingress.rgw.ext-website 172.16.39.133:443,1969 0/2 - > 4M osd-3 > mds.cephfs 2/2 9M > ago 4M count-per-host:1;label:mds > mgr 5/5 9M > ago 9M count:5 > mon 5/5 9M > ago 9M count:5 > node-exporter ?:9100 2/6 7M > ago 7w * > osd.all-available-devices 0 - > 5w * > osd.osd 54 <deleting> > 7M label:osd > osd.unmanaged 180 9M > ago - <unmanaged> > prometheus ?:9095 0/2 - > 8M count:2 > rgw.cubi 4/0 9M > ago - <unmanaged> > rgw.default ?:8100 2/1 7M > ago 4M osd-1 > rgw.ext ?:8100 2/1 7M > ago 4M osd-5 > rgw.ext-website ?:8200 0/1 - > 4M osd-3 > > 0|0[root@osd-1 ~]# ceph orch ps | grep starting | head -n 3 > osd.0 osd-1 starting - > - - 3072M <unknown> <unknown> <unknown> > osd.1 osd-2 starting - > - - 3072M <unknown> <unknown> <unknown> > osd.10 osd-1 starting - > - - 3072M <unknown> <unknown> <unknown> > > 0|0[root@osd-1 ~]# ceph orch ps | grep 15.2.14 | head -n 3 > mds.cephfs.osd-1.fhmalo osd-1 running (9M) 9M > ago 9M 370M - 15.2.14 d4c4064fa0de f138649b2e4f > mds.cephfs.osd-2.vqanmk osd-2 running (9M) 9M > ago 9M 3119M - 15.2.14 d4c4064fa0de a2752217770f > osd.100 osd-1 running (9M) 9M > ago 9M 3525M 3072M 15.2.14 d4c4064fa0de 1ea3fc9c3caf > > 0|0[root@osd-1 ~]# cephadm version > Using recent ceph image > quay.io/ceph/ceph@sha256:bb6a71f7f481985f6d3b358e3b9ef64c6755b3db5aa53198e0aac38be5c8ae54 > ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific > (stable) > > 0|0[root@osd-1 ~]# ceph versions > { > "mon": { > "ceph version 16.2.7 > (dd0603118f56ab514f133c8d2e3adfc983942503) pacific (stable)": 5 > }, > "mgr": { > "ceph version 16.2.7 > (dd0603118f56ab514f133c8d2e3adfc983942503) pacific (stable)": 4 > }, > "osd": { > "ceph version 16.2.7 > (dd0603118f56ab514f133c8d2e3adfc983942503) pacific (stable)": 270 > }, > "mds": { > "ceph version 16.2.7 > (dd0603118f56ab514f133c8d2e3adfc983942503) pacific (stable)": 2 > }, > "cephfs-mirror": { > "ceph version 16.2.7 > (dd0603118f56ab514f133c8d2e3adfc983942503) pacific (stable)": 1 > }, > "rgw": { > "ceph version 16.2.7 > (dd0603118f56ab514f133c8d2e3adfc983942503) pacific (stable)": 3 > }, > "overall": { > "ceph version 16.2.7 > (dd0603118f56ab514f133c8d2e3adfc983942503) pacific (stable)": 285 > } > } > > From MGR logs: > Jun 29 09:00:07 osd-5 bash[9702]: debug 2022-06-29T07:00:07.046+0000 > 7fdd4e467700 0 [cephadm ERROR cephadm.serve] cephadm exited with an > error code: 1, stderr:Pulling container image > quay.io/ceph/ceph:v16.2.9... > Jun 29 09:00:07 osd-5 bash[9702]: Non-zero exit code 1 from > /bin/docker pull quay.io/ceph/ceph:v16.2.9 > Jun 29 09:00:07 osd-5 bash[9702]: /bin/docker: stderr Error response > from daemon: Get "https://quay.io/v2/": context deadline exceeded > (Client.Timeout exceeded while awaiting headers) > Jun 29 09:00:07 osd-5 bash[9702]: ERROR: Failed command: /bin/docker > pull quay.io/ceph/ceph:v16.2.9 > Jun 29 09:00:07 osd-5 bash[9702]: Traceback (most recent call last): > Jun 29 09:00:07 osd-5 bash[9702]: File > "/usr/share/ceph/mgr/cephadm/serve.py", line 1363, in _remote_connection > Jun 29 09:00:07 osd-5 bash[9702]: yield (conn, connr) > Jun 29 09:00:07 osd-5 bash[9702]: File > "/usr/share/ceph/mgr/cephadm/serve.py", line 1256, in _run_cephadm > Jun 29 09:00:07 osd-5 bash[9702]: code, '\n'.join(err))) > > 0|0[root@osd-1 ~]# ceph orch upgrade ls > Error EINVAL: Traceback (most recent call last): > File "/usr/share/ceph/mgr/mgr_module.py", line 1384, in _handle_command > return self.handle_command(inbuf, cmd) > File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 168, in > handle_command > return dispatch[cmd['prefix']].call(self, cmd, inbuf) > File "/usr/share/ceph/mgr/mgr_module.py", line 397, in call > return self.func(mgr, **kwargs) > File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 107, in > <lambda> > wrapper_copy = lambda *l_args, **l_kwargs: wrapper(*l_args, > **l_kwargs) # noqa: E731 > File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 96, in > wrapper > return func(*args, **kwargs) > File "/usr/share/ceph/mgr/orchestrator/module.py", line 1337, in > _upgrade_ls > r = raise_if_exception(completion) > File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 225, in > raise_if_exception > raise e > requests.exceptions.ConnectionError: None: Max retries exceeded with > url: /v2/ceph/ceph/tags/list (Caused by None) > -- Mathias Kuhring Dr. rer. nat. Bioinformatician HPC & Core Unit Bioinformatics Berlin Institute of Health at Charité (BIH) E-Mail: mathias.kuhring@xxxxxxxxxxxxxx Mobile: +49 172 3475576 _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx