Orchestrator informations wrong and outdated

"Kuhring, Mathias" <mathias.kuhring@xxxxxxxxxxxxxx> · Wed, 29 Jun 2022 08:45:51 +0000

Dear Ceph community,

we are in the curious situation that typical orchestrator queries 
provide wrong or outdated information about different services.
E.g. `ceph orch ls` will report wrong numbers on active services.
Or `ceph orch ps` reports many OSDs as "starting" and many services with 
an old version (15.2.14, but we are on 16.2.7).
Also the refresh times seem way of (capital M == months?).
However, the cluster is healthy (`ceph status` is happy).
And sample validation of affected services with systemctl also shows 
that they are up and ok.

We already tried the following without success:

a) re-registering cephadm as orchestrator backend
0|0[root@osd-1 ~]# ceph orch pause
0|0[root@osd-1 ~]# ceph orch set backend ''
0|0[root@osd-1 ~]# ceph mgr module disable cephadm
0|0[root@osd-1 ~]# ceph orch ls
Error ENOENT: No orchestrator configured (try `ceph orch set backend`)
0|0[root@osd-1 ~]# ceph mgr module enable cephadm
0|0[root@osd-1 ~]# ceph orch set backend 'cephadm'

b) a failover of the MGR (hoping it would restart/reset the orchestrator 
module)
0|0[root@osd-1 ~]# ceph status | grep mgr
     mgr:           osd-1(active, since 6m), standbys: osd-5.jcfyqe, 
osd-4.oylrhe, osd-3
0|0[root@osd-1 ~]# ceph mgr fail
0|0[root@osd-1 ~]# ceph status | grep mgr
     mgr:           osd-5.jcfyqe(active, since 7s), standbys: 
osd-4.oylrhe, osd-3, osd-1

Is there any other way to somehow reset the orchestrator 
information/connection?
I added different relevant outputs below.

I also went through the MGR logs and found an issue with querying the 
docker repos.
I attempted to upgrade the MGRs to 16.2.9 a few weeks ago due to a 
different bug.
But this upgrade never went through.
Apparently due to cephadm not being able to pull the image.
Interestingly, I'm able to pull the image manually with docker pull. But 
cephadm is not.
I also get an error with `ceph orch upgrade ls` to check on available 
versions.
I'm not sure, if this is relevant to the orchestrator problem we have.
But to be safe, I also added the logs/output below.

Thank you for all your help!

Best Wishes,
Mathias

0|0[root@osd-1 ~]# ceph status
   cluster:
     id:     55633ec3-6c0c-4a02-990c-0f87e0f7a01f
     health: HEALTH_OK

   services:
     mon:           5 daemons, quorum osd-1,osd-2,osd-5,osd-4,osd-3 (age 
86m)
     mgr:           osd-5.jcfyqe(active, since 21m), standbys: 
osd-4.oylrhe, osd-3, osd-1
     mds:           1/1 daemons up, 1 standby
     osd:           270 osds: 270 up (since 13d), 270 in (since 5w)
     cephfs-mirror: 1 daemon active (1 hosts)
     rgw:           3 daemons active (3 hosts, 2 zones)

   data:
     volumes: 1/1 healthy
     pools:   17 pools, 6144 pgs
     objects: 692.54M objects, 1.2 PiB
     usage:   1.8 PiB used, 1.7 PiB / 3.5 PiB avail
     pgs:     6114 active+clean
              29   active+clean+scrubbing+deep
              1    active+clean+scrubbing

   io:
     client:   0 B/s rd, 421 MiB/s wr, 52 op/s rd, 240 op/s wr

0|0[root@osd-1 ~]# ceph orch ls
NAME                       PORTS                   RUNNING REFRESHED   
AGE  PLACEMENT
alertmanager               ?:9093,9094                 0/1 -           
8M   count:1
cephfs-mirror                                          0/1 -           
5M   count:1
crash                                                  2/6  7M ago      
4M   *
grafana                    ?:3000                      0/1 -           
8M   count:1
ingress.rgw.default        172.16.39.131:443,1967      0/2 -           
4M   osd-1
ingress.rgw.ext            172.16.39.132:443,1968      4/2  7M ago      
4M   osd-5
ingress.rgw.ext-website    172.16.39.133:443,1969      0/2 -           
4M   osd-3
mds.cephfs                                             2/2  9M ago      
4M   count-per-host:1;label:mds
mgr                                                    5/5  9M ago      
9M   count:5
mon                                                    5/5  9M ago      
9M   count:5
node-exporter              ?:9100                      2/6  7M ago      
7w   *
osd.all-available-devices                                0 -           
5w   *
osd.osd                                                 54 <deleting>  
7M   label:osd
osd.unmanaged                                          180  9M ago      
-    <unmanaged>
prometheus                 ?:9095                      0/2 -           
8M   count:2
rgw.cubi                                               4/0  9M ago      
-    <unmanaged>
rgw.default                ?:8100                      2/1  7M ago      
4M   osd-1
rgw.ext                    ?:8100                      2/1  7M ago      
4M   osd-5
rgw.ext-website            ?:8200                      0/1 -           
4M   osd-3

0|0[root@osd-1 ~]# ceph orch ps | grep starting | head -n 3
osd.0                            osd-1 starting              -    
-        -    3072M <unknown>       <unknown>     <unknown>
osd.1                            osd-2 starting              -    
-        -    3072M <unknown>       <unknown>     <unknown>
osd.10                           osd-1 starting              -    
-        -    3072M <unknown>       <unknown>     <unknown>

0|0[root@osd-1 ~]# ceph orch ps | grep 15.2.14 | head -n 3
mds.cephfs.osd-1.fhmalo          osd-1              running (9M) 9M 
ago   9M     370M        -  15.2.14         d4c4064fa0de f138649b2e4f
mds.cephfs.osd-2.vqanmk          osd-2              running (9M) 9M 
ago   9M    3119M        -  15.2.14         d4c4064fa0de a2752217770f
osd.100                          osd-1              running (9M) 9M 
ago   9M    3525M    3072M  15.2.14         d4c4064fa0de 1ea3fc9c3caf

0|0[root@osd-1 ~]# cephadm version
Using recent ceph image 
quay.io/ceph/ceph@sha256:bb6a71f7f481985f6d3b358e3b9ef64c6755b3db5aa53198e0aac38be5c8ae54
ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific 
(stable)

0|0[root@osd-1 ~]# ceph versions
{
     "mon": {
         "ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) 
pacific (stable)": 5
     },
     "mgr": {
         "ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) 
pacific (stable)": 4
     },
     "osd": {
         "ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) 
pacific (stable)": 270
     },
     "mds": {
         "ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) 
pacific (stable)": 2
     },
     "cephfs-mirror": {
         "ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) 
pacific (stable)": 1
     },
     "rgw": {
         "ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) 
pacific (stable)": 3
     },
     "overall": {
         "ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) 
pacific (stable)": 285
     }
}

 From MGR logs:
Jun 29 09:00:07 osd-5 bash[9702]: debug 2022-06-29T07:00:07.046+0000 
7fdd4e467700  0 [cephadm ERROR cephadm.serve] cephadm exited with an 
error code: 1, stderr:Pulling container image quay.io/ceph/ceph:v16.2.9...
Jun 29 09:00:07 osd-5 bash[9702]: Non-zero exit code 1 from /bin/docker 
pull quay.io/ceph/ceph:v16.2.9
Jun 29 09:00:07 osd-5 bash[9702]: /bin/docker: stderr Error response 
from daemon: Get "https://quay.io/v2/": context deadline exceeded 
(Client.Timeout exceeded while awaiting headers)
Jun 29 09:00:07 osd-5 bash[9702]: ERROR: Failed command: /bin/docker 
pull quay.io/ceph/ceph:v16.2.9
Jun 29 09:00:07 osd-5 bash[9702]: Traceback (most recent call last):
Jun 29 09:00:07 osd-5 bash[9702]: File 
"/usr/share/ceph/mgr/cephadm/serve.py", line 1363, in _remote_connection
Jun 29 09:00:07 osd-5 bash[9702]: yield (conn, connr)
Jun 29 09:00:07 osd-5 bash[9702]: File 
"/usr/share/ceph/mgr/cephadm/serve.py", line 1256, in _run_cephadm
Jun 29 09:00:07 osd-5 bash[9702]: code, '\n'.join(err)))

0|0[root@osd-1 ~]# ceph orch upgrade ls
Error EINVAL: Traceback (most recent call last):
   File "/usr/share/ceph/mgr/mgr_module.py", line 1384, in _handle_command
     return self.handle_command(inbuf, cmd)
   File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 168, in 
handle_command
     return dispatch[cmd['prefix']].call(self, cmd, inbuf)
   File "/usr/share/ceph/mgr/mgr_module.py", line 397, in call
     return self.func(mgr, **kwargs)
   File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 107, in 
<lambda>
     wrapper_copy = lambda *l_args, **l_kwargs: wrapper(*l_args, 
**l_kwargs)  # noqa: E731
   File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 96, in 
wrapper
     return func(*args, **kwargs)
   File "/usr/share/ceph/mgr/orchestrator/module.py", line 1337, in 
_upgrade_ls
     r = raise_if_exception(completion)
   File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 225, in 
raise_if_exception
     raise e
requests.exceptions.ConnectionError: None: Max retries exceeded with 
url: /v2/ceph/ceph/tags/list (Caused by None)

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx