Hi Adrian, Could you please open a tracker issue (https://tracker.ceph.com/) and share the traceback from the Dashboard crash? Thank you! Kind Regards, Ernesto On Sat, Mar 4, 2023 at 9:50 AM Adrian Nicolae <adrian.nicolae@xxxxxxxxxx> wrote: > Hi, > > I have some orchestrator issues on our cluster running 16.2.9 with rgw > only services. > > We first noticed these issues a few weeks ago when adding new hosts to > the cluster - the orch was not detecting the new drives to build the > osd containers for them. Debugging the mgr logs, I noticed that the mgr > was crashing due to the dashboard module. I disabled the dashboard > module and the new drives were detected and added to the cluster. > > Now we have other similar issues : we have a failed drive. The failure > was detected , the osd was marked as down and the rebalancing is > finished. I want to remove the failed osd from the cluster but it looks > like the orch is not working : > > - I launched the osd removal with 'ceph orch osd rm 92 --force' where > 92 is the osd id in question > > - I checked the progress but nothing happens even after a few days : > > ceph orch osd rm status > OSD HOST STATE PGS REPLACE FORCE ZAP DRAIN STARTED AT > *92 node10 started 0 False True False* > > - the osd process is stopped on that host and from the orch side I can > see this : > > ceph orch ps --daemon_type osd --daemon_id 92 > NAME HOST PORTS STATUS REFRESHED AGE MEM USE MEM LIM > VERSION IMAGE ID > osd.92 node10 error 11h ago 4w - 4096M > <unknown> <unknown> > > - I have the same long refresh interval on other osds as well. I know it > should be 10 minutes or so : > > NAME HOST PORTS STATUS REFRESHED AGE MEM USE MEM LIM > VERSION IMAGE ID CONTAINER ID > osd.93 node09 running (4w) 11h ago 4w 5573M 4096M > 16.2.9 3520ead5eb19 d2f658e9e37b > > osd: *116 osds: 115 up* (since 11d), 115 in (since 11d) > > 90 hdd 16.37109 1.00000 16 TiB 4.0 TiB 4.0 TiB 0 B 17 > GiB 12 TiB 24.66 0.97 146 up > * 92 hdd 0 0 0 B 0 B 0 B 0 B 0 > B 0 B 0 0 0 down* > 94 hdd 16.37109 1.00000 16 TiB 4.0 TiB 4.0 TiB 0 B 17 > GiB 12 TiB 24.66 0.97 146 up > > - I activated debug 20 on the mgr but I can't see any errors or other > clues regarding the osd removal. I also switched to the standby manager > with 'ceph mgr fail'. The mgr switch works but still nothing happens > > - It's not only the osd removal thing. I also tried to deploy new rgw > services by applying rgw labels on 2 new hosts , we have specs for > building rgw containers when detecting the label. Again, nothing happens. > > I'm planning to upgrade to 16.2.11 to see if this solves the issues but > I'm not very confident, I didn't see anything regarding this in the > changelogs. Is there anything else I can try to debug this issue ? > > Thanks. > > > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx