Hi,
I have some orchestrator issues on our cluster running 16.2.9 with rgw
only services.
We first noticed these issues a few weeks ago when adding new hosts to
the cluster - the orch was not detecting the new drives to build the
osd containers for them. Debugging the mgr logs, I noticed that the mgr
was crashing due to the dashboard module. I disabled the dashboard
module and the new drives were detected and added to the cluster.
Now we have other similar issues : we have a failed drive. The failure
was detected , the osd was marked as down and the rebalancing is
finished. I want to remove the failed osd from the cluster but it looks
like the orch is not working :
- I launched the osd removal with 'ceph orch osd rm 92 --force' where
92 is the osd id in question
- I checked the progress but nothing happens even after a few days :
ceph orch osd rm status
OSD HOST STATE PGS REPLACE FORCE ZAP DRAIN STARTED AT
*92 node10 started 0 False True False*
- the osd process is stopped on that host and from the orch side I can
see this :
ceph orch ps --daemon_type osd --daemon_id 92
NAME HOST PORTS STATUS REFRESHED AGE MEM USE MEM LIM
VERSION IMAGE ID
osd.92 node10 error 11h ago 4w - 4096M
<unknown> <unknown>
- I have the same long refresh interval on other osds as well. I know it
should be 10 minutes or so :
NAME HOST PORTS STATUS REFRESHED AGE MEM USE MEM LIM
VERSION IMAGE ID CONTAINER ID
osd.93 node09 running (4w) 11h ago 4w 5573M 4096M
16.2.9 3520ead5eb19 d2f658e9e37b
osd: *116 osds: 115 up* (since 11d), 115 in (since 11d)
90 hdd 16.37109 1.00000 16 TiB 4.0 TiB 4.0 TiB 0 B 17
GiB 12 TiB 24.66 0.97 146 up
* 92 hdd 0 0 0 B 0 B 0 B 0 B 0
B 0 B 0 0 0 down*
94 hdd 16.37109 1.00000 16 TiB 4.0 TiB 4.0 TiB 0 B 17
GiB 12 TiB 24.66 0.97 146 up
- I activated debug 20 on the mgr but I can't see any errors or other
clues regarding the osd removal. I also switched to the standby manager
with 'ceph mgr fail'. The mgr switch works but still nothing happens
- It's not only the osd removal thing. I also tried to deploy new rgw
services by applying rgw labels on 2 new hosts , we have specs for
building rgw containers when detecting the label. Again, nothing happens.
I'm planning to upgrade to 16.2.11 to see if this solves the issues but
I'm not very confident, I didn't see anything regarding this in the
changelogs. Is there anything else I can try to debug this issue ?
Thanks.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx