Re: orchestrator issues on ceph 16.2.9

Ernesto Puerta <epuertat@xxxxxxxxxx> · Mon, 6 Mar 2023 11:56:31 +0100

Hi Adrian,

Could you please open a tracker issue (https://tracker.ceph.com/) and share
the traceback from the Dashboard crash?

Thank you!

Kind Regards,
Ernesto

On Sat, Mar 4, 2023 at 9:50 AM Adrian Nicolae <adrian.nicolae@xxxxxxxxxx>
wrote:

> Hi,
>
> I have some orchestrator issues on our cluster running 16.2.9 with rgw
> only services.
>
> We first noticed these issues a few weeks ago when adding new hosts to
> the cluster -  the orch was not detecting the new drives to build the
> osd containers for them. Debugging the mgr logs, I noticed that the mgr
> was crashing due to the dashboard module.  I disabled the dashboard
> module and the new drives were detected and added to the cluster.
>
> Now we have other similar issues : we have a failed drive. The failure
> was detected , the osd was marked as down and the rebalancing is
> finished. I want to remove the failed osd from the cluster but it looks
> like the orch is not working :
>
>   - I launched the osd removal with 'ceph orch osd rm 92 --force' where
> 92 is the osd id in question
>
> - I checked the progress but nothing happens even after a few days :
>
> ceph orch osd rm status
> OSD  HOST    STATE    PGS  REPLACE  FORCE  ZAP    DRAIN STARTED AT
> *92   node10  started    0  False    True   False*
>
> -  the osd process is stopped on that host and from the orch side I can
> see this :
>
> ceph orch ps --daemon_type osd --daemon_id 92
> NAME    HOST    PORTS  STATUS  REFRESHED  AGE  MEM USE  MEM LIM
> VERSION    IMAGE ID
> osd.92  node10         error     11h ago   4w        -    4096M
> <unknown>  <unknown>
>
> - I have the same long refresh interval on other osds as well. I know it
> should be 10 minutes or so :
>
> NAME    HOST    PORTS  STATUS        REFRESHED  AGE  MEM USE  MEM LIM
> VERSION  IMAGE ID      CONTAINER ID
> osd.93  node09         running (4w)    11h ago   4w    5573M 4096M
> 16.2.9   3520ead5eb19  d2f658e9e37b
>
>   osd: *116 osds: 115 up* (since 11d), 115 in (since 11d)
>
>   90    hdd  16.37109   1.00000   16 TiB  4.0 TiB  4.0 TiB      0 B   17
> GiB    12 TiB  24.66  0.97  146      up
> * 92    hdd         0         0      0 B      0 B      0 B 0 B      0
> B       0 B      0     0    0    down*
>   94    hdd  16.37109   1.00000   16 TiB  4.0 TiB  4.0 TiB      0 B   17
> GiB    12 TiB  24.66  0.97  146      up
>
> - I activated debug 20 on the mgr but I can't see any errors or other
> clues regarding the osd removal. I also switched to the standby manager
> with 'ceph mgr fail'. The mgr switch works but still nothing happens
>
> - It's not only the osd removal thing. I also tried to deploy new rgw
> services by applying rgw labels on 2 new hosts , we have specs for
> building rgw containers when detecting the label.  Again, nothing happens.
>
> I'm planning to upgrade to 16.2.11 to see if this solves the issues but
> I'm not very confident, I didn't see anything regarding this in the
> changelogs. Is there anything else I can try to debug this issue ?
>
> Thanks.
>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx