orchestrator issues on ceph 16.2.9

Adrian Nicolae <adrian.nicolae@xxxxxxxxxx> · Sat, 4 Mar 2023 10:49:32 +0200

Hi,

I have some orchestrator issues on our cluster running 16.2.9 with rgw 
only services.

We first noticed these issues a few weeks ago when adding new hosts to 
the cluster -  the orch was not detecting the new drives to build the 
osd containers for them. Debugging the mgr logs, I noticed that the mgr 
was crashing due to the dashboard module.  I disabled the dashboard 
module and the new drives were detected and added to the cluster.

Now we have other similar issues : we have a failed drive. The failure 
was detected , the osd was marked as down and the rebalancing is 
finished. I want to remove the failed osd from the cluster but it looks 
like the orch is not working :

 - I launched the osd removal with 'ceph orch osd rm 92 --force' where 
92 is the osd id in question

- I checked the progress but nothing happens even after a few days :

ceph orch osd rm status
OSD  HOST    STATE    PGS  REPLACE  FORCE  ZAP    DRAIN STARTED AT
*92   node10  started    0  False    True   False*

-  the osd process is stopped on that host and from the orch side I can 
see this :

ceph orch ps --daemon_type osd --daemon_id 92
NAME    HOST    PORTS  STATUS  REFRESHED  AGE  MEM USE  MEM LIM 
VERSION    IMAGE ID
osd.92  node10         error     11h ago   4w        -    4096M 
<unknown>  <unknown>

- I have the same long refresh interval on other osds as well. I know it 
should be 10 minutes or so :

NAME    HOST    PORTS  STATUS        REFRESHED  AGE  MEM USE  MEM LIM  
VERSION  IMAGE ID      CONTAINER ID
osd.93  node09         running (4w)    11h ago   4w    5573M 4096M  
16.2.9   3520ead5eb19  d2f658e9e37b

 osd: *116 osds: 115 up* (since 11d), 115 in (since 11d)

 90    hdd  16.37109   1.00000   16 TiB  4.0 TiB  4.0 TiB      0 B   17 
GiB    12 TiB  24.66  0.97  146      up
* 92    hdd         0         0      0 B      0 B      0 B 0 B      0 
B       0 B      0     0    0    down*
 94    hdd  16.37109   1.00000   16 TiB  4.0 TiB  4.0 TiB      0 B   17 
GiB    12 TiB  24.66  0.97  146      up

- I activated debug 20 on the mgr but I can't see any errors or other 
clues regarding the osd removal. I also switched to the standby manager 
with 'ceph mgr fail'. The mgr switch works but still nothing happens

- It's not only the osd removal thing. I also tried to deploy new rgw 
services by applying rgw labels on 2 new hosts , we have specs for 
building rgw containers when detecting the label.  Again, nothing happens.

I'm planning to upgrade to 16.2.11 to see if this solves the issues but 
I'm not very confident, I didn't see anything regarding this in the 
changelogs. Is there anything else I can try to debug this issue ?

Thanks.

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx