Re: Orchestrator informations wrong and outdated

"Kuhring, Mathias" <mathias.kuhring@xxxxxxxxxxxxxx> · Fri, 1 Jul 2022 15:21:46 +0000

We noticed that our DNS settings were inconsistent and partially wrong.
The NetworkManager somehow set useless nameservers in the 
/etc/resolv.conf of our hosts.
But in particular, the DNS settings in the MGR containers needed fixing 
as well.
I fixed etc/resolv.conf on our hosts and in the container of the active 
MGR daemon.
This fixed all the issues I described, including the output orch ps and 
orch ls as well as registry queries such as docker pull and upgrade ls.
Afterwards, I was able to do an upgrade to Quinzy.
And as far as I can tell, the newly deployed MGR containers picked up 
the proper DNS settings from the hosts.

Best, Mathias

On 6/29/2022 10:45 AM, Mathias Kuhring wrote:
> Dear Ceph community,
>
> we are in the curious situation that typical orchestrator queries 
> provide wrong or outdated information about different services.
> E.g. `ceph orch ls` will report wrong numbers on active services.
> Or `ceph orch ps` reports many OSDs as "starting" and many services 
> with an old version (15.2.14, but we are on 16.2.7).
> Also the refresh times seem way of (capital M == months?).
> However, the cluster is healthy (`ceph status` is happy).
> And sample validation of affected services with systemctl also shows 
> that they are up and ok.
>
> We already tried the following without success:
>
> a) re-registering cephadm as orchestrator backend
> 0|0[root@osd-1 ~]# ceph orch pause
> 0|0[root@osd-1 ~]# ceph orch set backend ''
> 0|0[root@osd-1 ~]# ceph mgr module disable cephadm
> 0|0[root@osd-1 ~]# ceph orch ls
> Error ENOENT: No orchestrator configured (try `ceph orch set backend`)
> 0|0[root@osd-1 ~]# ceph mgr module enable cephadm
> 0|0[root@osd-1 ~]# ceph orch set backend 'cephadm'
>
> b) a failover of the MGR (hoping it would restart/reset the 
> orchestrator module)
> 0|0[root@osd-1 ~]# ceph status | grep mgr
>     mgr:           osd-1(active, since 6m), standbys: osd-5.jcfyqe, 
> osd-4.oylrhe, osd-3
> 0|0[root@osd-1 ~]# ceph mgr fail
> 0|0[root@osd-1 ~]# ceph status | grep mgr
>     mgr:           osd-5.jcfyqe(active, since 7s), standbys: 
> osd-4.oylrhe, osd-3, osd-1
>
> Is there any other way to somehow reset the orchestrator 
> information/connection?
> I added different relevant outputs below.
>
> I also went through the MGR logs and found an issue with querying the 
> docker repos.
> I attempted to upgrade the MGRs to 16.2.9 a few weeks ago due to a 
> different bug.
> But this upgrade never went through.
> Apparently due to cephadm not being able to pull the image.
> Interestingly, I'm able to pull the image manually with docker pull. 
> But cephadm is not.
> I also get an error with `ceph orch upgrade ls` to check on available 
> versions.
> I'm not sure, if this is relevant to the orchestrator problem we have.
> But to be safe, I also added the logs/output below.
>
> Thank you for all your help!
>
> Best Wishes,
> Mathias
>
>
> 0|0[root@osd-1 ~]# ceph status
>   cluster:
>     id:     55633ec3-6c0c-4a02-990c-0f87e0f7a01f
>     health: HEALTH_OK
>
>   services:
>     mon:           5 daemons, quorum osd-1,osd-2,osd-5,osd-4,osd-3 
> (age 86m)
>     mgr:           osd-5.jcfyqe(active, since 21m), standbys: 
> osd-4.oylrhe, osd-3, osd-1
>     mds:           1/1 daemons up, 1 standby
>     osd:           270 osds: 270 up (since 13d), 270 in (since 5w)
>     cephfs-mirror: 1 daemon active (1 hosts)
>     rgw:           3 daemons active (3 hosts, 2 zones)
>
>   data:
>     volumes: 1/1 healthy
>     pools:   17 pools, 6144 pgs
>     objects: 692.54M objects, 1.2 PiB
>     usage:   1.8 PiB used, 1.7 PiB / 3.5 PiB avail
>     pgs:     6114 active+clean
>              29   active+clean+scrubbing+deep
>              1    active+clean+scrubbing
>
>   io:
>     client:   0 B/s rd, 421 MiB/s wr, 52 op/s rd, 240 op/s wr
>
> 0|0[root@osd-1 ~]# ceph orch ls
> NAME                       PORTS                   RUNNING REFRESHED   
> AGE  PLACEMENT
> alertmanager               ?:9093,9094                 0/1 -           
> 8M   count:1
> cephfs-mirror                                          0/1 -           
> 5M   count:1
> crash                                                  2/6  7M 
> ago      4M   *
> grafana                    ?:3000                      0/1 -           
> 8M   count:1
> ingress.rgw.default        172.16.39.131:443,1967      0/2 -           
> 4M   osd-1
> ingress.rgw.ext            172.16.39.132:443,1968      4/2  7M 
> ago      4M   osd-5
> ingress.rgw.ext-website    172.16.39.133:443,1969      0/2 -           
> 4M   osd-3
> mds.cephfs                                             2/2  9M 
> ago      4M   count-per-host:1;label:mds
> mgr                                                    5/5  9M 
> ago      9M   count:5
> mon                                                    5/5  9M 
> ago      9M   count:5
> node-exporter              ?:9100                      2/6  7M 
> ago      7w   *
> osd.all-available-devices                                0 -           
> 5w   *
> osd.osd                                                 54 <deleting>  
> 7M   label:osd
> osd.unmanaged                                          180  9M 
> ago      -    <unmanaged>
> prometheus                 ?:9095                      0/2 -           
> 8M   count:2
> rgw.cubi                                               4/0  9M 
> ago      -    <unmanaged>
> rgw.default                ?:8100                      2/1  7M 
> ago      4M   osd-1
> rgw.ext                    ?:8100                      2/1  7M 
> ago      4M   osd-5
> rgw.ext-website            ?:8200                      0/1 -           
> 4M   osd-3
>
> 0|0[root@osd-1 ~]# ceph orch ps | grep starting | head -n 3
> osd.0                            osd-1 starting              - 
> -        -    3072M <unknown>       <unknown> <unknown>
> osd.1                            osd-2 starting              - 
> -        -    3072M <unknown>       <unknown> <unknown>
> osd.10                           osd-1 starting              - 
> -        -    3072M <unknown>       <unknown> <unknown>
>
> 0|0[root@osd-1 ~]# ceph orch ps | grep 15.2.14 | head -n 3
> mds.cephfs.osd-1.fhmalo          osd-1              running (9M) 9M 
> ago   9M     370M        -  15.2.14         d4c4064fa0de f138649b2e4f
> mds.cephfs.osd-2.vqanmk          osd-2              running (9M) 9M 
> ago   9M    3119M        -  15.2.14         d4c4064fa0de a2752217770f
> osd.100                          osd-1              running (9M) 9M 
> ago   9M    3525M    3072M  15.2.14         d4c4064fa0de 1ea3fc9c3caf
>
> 0|0[root@osd-1 ~]# cephadm version
> Using recent ceph image 
> quay.io/ceph/ceph@sha256:bb6a71f7f481985f6d3b358e3b9ef64c6755b3db5aa53198e0aac38be5c8ae54
> ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific 
> (stable)
>
> 0|0[root@osd-1 ~]# ceph versions
> {
>     "mon": {
>         "ceph version 16.2.7 
> (dd0603118f56ab514f133c8d2e3adfc983942503) pacific (stable)": 5
>     },
>     "mgr": {
>         "ceph version 16.2.7 
> (dd0603118f56ab514f133c8d2e3adfc983942503) pacific (stable)": 4
>     },
>     "osd": {
>         "ceph version 16.2.7 
> (dd0603118f56ab514f133c8d2e3adfc983942503) pacific (stable)": 270
>     },
>     "mds": {
>         "ceph version 16.2.7 
> (dd0603118f56ab514f133c8d2e3adfc983942503) pacific (stable)": 2
>     },
>     "cephfs-mirror": {
>         "ceph version 16.2.7 
> (dd0603118f56ab514f133c8d2e3adfc983942503) pacific (stable)": 1
>     },
>     "rgw": {
>         "ceph version 16.2.7 
> (dd0603118f56ab514f133c8d2e3adfc983942503) pacific (stable)": 3
>     },
>     "overall": {
>         "ceph version 16.2.7 
> (dd0603118f56ab514f133c8d2e3adfc983942503) pacific (stable)": 285
>     }
> }
>
> From MGR logs:
> Jun 29 09:00:07 osd-5 bash[9702]: debug 2022-06-29T07:00:07.046+0000 
> 7fdd4e467700  0 [cephadm ERROR cephadm.serve] cephadm exited with an 
> error code: 1, stderr:Pulling container image 
> quay.io/ceph/ceph:v16.2.9...
> Jun 29 09:00:07 osd-5 bash[9702]: Non-zero exit code 1 from 
> /bin/docker pull quay.io/ceph/ceph:v16.2.9
> Jun 29 09:00:07 osd-5 bash[9702]: /bin/docker: stderr Error response 
> from daemon: Get "https://quay.io/v2/": context deadline exceeded 
> (Client.Timeout exceeded while awaiting headers)
> Jun 29 09:00:07 osd-5 bash[9702]: ERROR: Failed command: /bin/docker 
> pull quay.io/ceph/ceph:v16.2.9
> Jun 29 09:00:07 osd-5 bash[9702]: Traceback (most recent call last):
> Jun 29 09:00:07 osd-5 bash[9702]: File 
> "/usr/share/ceph/mgr/cephadm/serve.py", line 1363, in _remote_connection
> Jun 29 09:00:07 osd-5 bash[9702]: yield (conn, connr)
> Jun 29 09:00:07 osd-5 bash[9702]: File 
> "/usr/share/ceph/mgr/cephadm/serve.py", line 1256, in _run_cephadm
> Jun 29 09:00:07 osd-5 bash[9702]: code, '\n'.join(err)))
>
> 0|0[root@osd-1 ~]# ceph orch upgrade ls
> Error EINVAL: Traceback (most recent call last):
>   File "/usr/share/ceph/mgr/mgr_module.py", line 1384, in _handle_command
>     return self.handle_command(inbuf, cmd)
>   File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 168, in 
> handle_command
>     return dispatch[cmd['prefix']].call(self, cmd, inbuf)
>   File "/usr/share/ceph/mgr/mgr_module.py", line 397, in call
>     return self.func(mgr, **kwargs)
>   File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 107, in 
> <lambda>
>     wrapper_copy = lambda *l_args, **l_kwargs: wrapper(*l_args, 
> **l_kwargs)  # noqa: E731
>   File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 96, in 
> wrapper
>     return func(*args, **kwargs)
>   File "/usr/share/ceph/mgr/orchestrator/module.py", line 1337, in 
> _upgrade_ls
>     r = raise_if_exception(completion)
>   File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 225, in 
> raise_if_exception
>     raise e
> requests.exceptions.ConnectionError: None: Max retries exceeded with 
> url: /v2/ceph/ceph/tags/list (Caused by None)
>
-- 
Mathias Kuhring

Dr. rer. nat.
Bioinformatician
HPC & Core Unit Bioinformatics
Berlin Institute of Health at Charité (BIH)

E-Mail:  mathias.kuhring@xxxxxxxxxxxxxx
Mobile: +49 172 3475576

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx