orchestrator issues on ceph 16.2.9

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

I have some orchestrator issues on our cluster running 16.2.9 with rgw only services.

We first noticed these issues a few weeks ago when adding new hosts to the cluster -  the orch was not detecting the new drives to build the osd containers for them. Debugging the mgr logs, I noticed that the mgr was crashing due to the dashboard module.  I disabled the dashboard module and the new drives were detected and added to the cluster.

Now we have other similar issues : we have a failed drive. The failure was detected , the osd was marked as down and the rebalancing is finished. I want to remove the failed osd from the cluster but it looks like the orch is not working :

 - I launched the osd removal with 'ceph orch osd rm 92 --force' where 92 is the osd id in question

- I checked the progress but nothing happens even after a few days :

ceph orch osd rm status
OSD  HOST    STATE    PGS  REPLACE  FORCE  ZAP    DRAIN STARTED AT
*92   node10  started    0  False    True   False*

-  the osd process is stopped on that host and from the orch side I can see this :

ceph orch ps --daemon_type osd --daemon_id 92
NAME    HOST    PORTS  STATUS  REFRESHED  AGE  MEM USE  MEM LIM VERSION    IMAGE ID osd.92  node10         error     11h ago   4w        -    4096M <unknown>  <unknown>

- I have the same long refresh interval on other osds as well. I know it should be 10 minutes or so :

NAME    HOST    PORTS  STATUS        REFRESHED  AGE  MEM USE  MEM LIM  VERSION  IMAGE ID      CONTAINER ID osd.93  node09         running (4w)    11h ago   4w    5573M 4096M  16.2.9   3520ead5eb19  d2f658e9e37b

 osd: *116 osds: 115 up* (since 11d), 115 in (since 11d)

 90    hdd  16.37109   1.00000   16 TiB  4.0 TiB  4.0 TiB      0 B   17 GiB    12 TiB  24.66  0.97  146      up * 92    hdd         0         0      0 B      0 B      0 B 0 B      0 B       0 B      0     0    0    down*  94    hdd  16.37109   1.00000   16 TiB  4.0 TiB  4.0 TiB      0 B   17 GiB    12 TiB  24.66  0.97  146      up

- I activated debug 20 on the mgr but I can't see any errors or other clues regarding the osd removal. I also switched to the standby manager with 'ceph mgr fail'. The mgr switch works but still nothing happens

- It's not only the osd removal thing. I also tried to deploy new rgw services by applying rgw labels on 2 new hosts , we have specs for building rgw containers when detecting the label.  Again, nothing happens.

I'm planning to upgrade to 16.2.11 to see if this solves the issues but I'm not very confident, I didn't see anything regarding this in the changelogs. Is there anything else I can try to debug this issue ?

Thanks.


_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux