So basically it's failing here: > self.to_remove_osds.load_from_store() This function is responsible of loading Specs from the mon-store. The information is stored in json format and it seems the stored json for the OSD(s) is not valid for some reason. You can see what's stored in the mon-store by running: > ceph config-key dump Don't share the information publicly here especially if it's a production cluster as it may have sensitive information about your cluster. Best, Redo. On Thu, Oct 17, 2024 at 5:04 PM Malte Stroem <malte.stroem@xxxxxxxxx> wrote: > Thanks Eugen & Redouane, > > of course I tried enabling and disabling the cephadm module for the MGRs. > > Running ceph mgr module enable cephadm produces this output in the MGR log: > > -1 mgr load Failed to construct class in 'cephadm' > -1 mgr load Traceback (most recent call last): > File "/usr/share/ceph/mgr/cephadm/module.py", line 619, in __init__ > self.to_remove_osds.load_from_store() > File "/usr/share/ceph/mgr/cephadm/services/osd.py", line 922, in > load_from_store > for osd in json.loads(v): > File "/lib64/python3.9/json/__init__.py", line 346, in loads > return _default_decoder.decode(s) > File "/lib64/python3.9/json/decoder.py", line 337, in decode > obj, end = self.raw_decode(s, idx=_w(s, 0).end()) > File "/lib64/python3.9/json/decoder.py", line 355, in raw_decode > raise JSONDecodeError("Expecting value", s, err.value) from None > json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0) > > > -1 mgr operator() Failed to run module in active mode ('cephadm') > > This comes from inside the MGR container because it's Python3.9. On the > hosts it'S Python3.11. > > I think of redeploying an MGR. > > Can I stop the existing MGRs? > > Redeploying with ceph orch does not work of course, but I think this > will work: > > > https://docs.ceph.com/en/latest/cephadm/troubleshooting/#manually-deploying-a-manager-daemon > > because cephadm standalone is working. Crazy as it sounds. > > What do you think? > > Best, > Malte > > On 17.10.24 12:49, Eugen Block wrote: > > Hi, > > > > if you just execute cephadm commands, those are issued locally on the > > hosts, they won't confirm an orchestrator issue immediately. > > What does the active MGR log? It could show a stack trace or error > > messages which could point to a root cause. > > > >> What about the cephadm files under /var/lib/ceph/fsid? Can I replace > >> the latest? > > > > Those are the cephadm versions the orchestrator actually uses, it will > > just download them again from your registry (or upstream). > > Can you share: > > > > ceph -s > > ceph versions > > MGR logs (active MGR) > > > > Thanks, > > Eugen > > > > Zitat von Malte Stroem <malte.stroem@xxxxxxxxx>: > > > >> Hello, > >> > >> I am still struggling here and do not know the root cause of this issue. > >> > >> Searching the list I found lots of people who had the same or a > >> similar problem the last years. > >> > >> However there is no solution four our cluster. > >> > >> Disabling and enabling the cephadm module does not work. There are no > >> error messages. When we run "ceph orch..." we get the error message: > >> > >> Error ENOENT: No orchestrator configured (try `ceph orch set backend`) > >> > >> But every single cephadm command works! > >> > >> cephadm ls for example. > >> > >> Stopping and restarting the MGRs did not help. Removing the .asok > >> files did not help. > >> > >> I think of stopping both MGRs and trying to deploy a new MGR like this: > >> > >> https://docs.ceph.com/en/latest/cephadm/troubleshooting/#manually- > >> deploying-a-manager-daemon > >> > >> How could I find the root cause? Is the cephadm somehow broken? > >> > >> What about the cephadm files under /var/lib/ceph/fsid? Can I replace > >> the latest? > >> > >> Best, > >> Malte > >> > >> On 16.10.24 14:54, Malte Stroem wrote: > >>> Hi Laimis, > >>> > >>> that did not work. Still ceph orch does not work. > >>> > >>> Best, > >>> Malte > >>> > >>> On 16.10.24 14:12, Malte Stroem wrote: > >>>> Thank you, Laimis. > >>>> > >>>> And you got the same error message? That's strange. > >>>> > >>>> In the mean time I try to check for clients connected. No Kubernetes > >>>> and CephFS, but RGWs. > >>>> > >>>> Best, > >>>> Malte > >>>> > >>>> On 16.10.24 14:01, Laimis Juzeliūnas wrote: > >>>>> Hi Malte, > >>>>> > >>>>> We have faced this recently when upgrading to Squid from latest Reef. > >>>>> As a temporary workaround we disabled the balancer with ‘ceph > >>>>> balancer off’ and restarted mgr daemons. > >>>>> We are suspecting older clients (from Kubernetes RBD mounts as well > >>>>> as CephFS mounts) on servers with incompatible client versions but > >>>>> are yet to dig through it. > >>>>> > >>>>> Best, > >>>>> Laimis J. > >>>>> > >>>>>> On 16 Oct 2024, at 14:57, Malte Stroem <malte.stroem@xxxxxxxxx> > >>>>>> wrote: > >>>>>> > >>>>>> Error ENOENT: No orchestrator configured (try `ceph orch set > >>>>>> backend`) > >>>>> > >>>>> _______________________________________________ > >>>>> ceph-users mailing list -- ceph-users@xxxxxxx > >>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx > >>>> _______________________________________________ > >>>> ceph-users mailing list -- ceph-users@xxxxxxx > >>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx > >>> _______________________________________________ > >>> ceph-users mailing list -- ceph-users@xxxxxxx > >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx > >> _______________________________________________ > >> ceph-users mailing list -- ceph-users@xxxxxxx > >> To unsubscribe send an email to ceph-users-leave@xxxxxxx > > > > > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx