Hi Malte, Check this solution posted here [1] by Alex. Cheers, Frédéric. [1] https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/message/PEJC7ANB6EHXWE2W4NIGN2VGBGIX4SD4/ ________________________________ De : Malte Stroem <malte.stroem@xxxxxxxxx> Envoyé : jeudi 17 octobre 2024 20:24 À : Eugen Block; ceph-users@xxxxxxx Objet : Re: "ceph orch" not working anymore You're so cool, Eugen. Somehow you seem to find out everything. Yes, this seems to be the issue and I suspected a bug there. Looking here: https://github.com/ceph/ceph/blob/main/src/pybind/mgr/cephadm/services/osd.py The diff is included in the code. What can I do now? Get the latest cephadm and put it on the node? What about the cephadm under /var/lib/ceph/fsid? I am not sure how to continue. I would download the latest cephadm and put it under /usr/sbin. Then disable the module with ceph mgr module disable cephadm and enable it ceph mgr module enable cephadm Best, Malte On 17.10.24 19:20, Eugen Block wrote: > Oh why didn’t you mention earlier that you removed OSDs? 😄 it sounds > like this one: > > https://tracker.ceph.com/issues/67329 > > Zitat von Malte Stroem <malte.stroem@xxxxxxxxx>: > >> Hello Redouane, >> >> thank you. Interesting. >> >> ceph config-key dump >> >> shows about 42000 lines. >> >> What can I search for? Something with OSDs. >> >> But there are thousands of entries. >> >> And if I find something, how can I fix that? >> >> I think there are entries of the OSDs from the broken node we removed. >> >> Best, >> Malte >> >> On 17.10.24 17:46, Redouane Kachach wrote: >>> So basically it's failing here: >>> >>>> self.to_remove_osds.load_from_store() >>> >>> This function is responsible of loading Specs from the mon-store. The >>> information is stored in json format and it seems the >>> stored json for the OSD(s) is not valid for some reason. You can see >>> what's >>> stored in the mon-store by running: >>> >>>> ceph config-key dump >>> >>> Don't share the information publicly here especially if it's a >>> production cluster as it may have sensitive information about your >>> cluster. >>> >>> Best, >>> Redo. >>> >>> >>> >>> >>> >>> >>> On Thu, Oct 17, 2024 at 5:04 PM Malte Stroem <malte.stroem@xxxxxxxxx> >>> wrote: >>> >>>> Thanks Eugen & Redouane, >>>> >>>> of course I tried enabling and disabling the cephadm module for the >>>> MGRs. >>>> >>>> Running ceph mgr module enable cephadm produces this output in the >>>> MGR log: >>>> >>>> -1 mgr load Failed to construct class in 'cephadm' >>>> -1 mgr load Traceback (most recent call last): >>>> File "/usr/share/ceph/mgr/cephadm/module.py", line 619, in __init__ >>>> self.to_remove_osds.load_from_store() >>>> File "/usr/share/ceph/mgr/cephadm/services/osd.py", line 922, in >>>> load_from_store >>>> for osd in json.loads(v): >>>> File "/lib64/python3.9/json/__init__.py", line 346, in loads >>>> return _default_decoder.decode(s) >>>> File "/lib64/python3.9/json/decoder.py", line 337, in decode >>>> obj, end = self.raw_decode(s, idx=_w(s, 0).end()) >>>> File "/lib64/python3.9/json/decoder.py", line 355, in raw_decode >>>> raise JSONDecodeError("Expecting value", s, err.value) from None >>>> json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0) >>>> >>>> >>>> -1 mgr operator() Failed to run module in active mode ('cephadm') >>>> >>>> This comes from inside the MGR container because it's Python3.9. On the >>>> hosts it'S Python3.11. >>>> >>>> I think of redeploying an MGR. >>>> >>>> Can I stop the existing MGRs? >>>> >>>> Redeploying with ceph orch does not work of course, but I think this >>>> will work: >>>> >>>> >>>> https://docs.ceph.com/en/latest/cephadm/troubleshooting/#manually- >>>> deploying-a-manager-daemon >>>> >>>> because cephadm standalone is working. Crazy as it sounds. >>>> >>>> What do you think? >>>> >>>> Best, >>>> Malte >>>> >>>> On 17.10.24 12:49, Eugen Block wrote: >>>>> Hi, >>>>> >>>>> if you just execute cephadm commands, those are issued locally on the >>>>> hosts, they won't confirm an orchestrator issue immediately. >>>>> What does the active MGR log? It could show a stack trace or error >>>>> messages which could point to a root cause. >>>>> >>>>>> What about the cephadm files under /var/lib/ceph/fsid? Can I replace >>>>>> the latest? >>>>> >>>>> Those are the cephadm versions the orchestrator actually uses, it will >>>>> just download them again from your registry (or upstream). >>>>> Can you share: >>>>> >>>>> ceph -s >>>>> ceph versions >>>>> MGR logs (active MGR) >>>>> >>>>> Thanks, >>>>> Eugen >>>>> >>>>> Zitat von Malte Stroem <malte.stroem@xxxxxxxxx>: >>>>> >>>>>> Hello, >>>>>> >>>>>> I am still struggling here and do not know the root cause of this >>>>>> issue. >>>>>> >>>>>> Searching the list I found lots of people who had the same or a >>>>>> similar problem the last years. >>>>>> >>>>>> However there is no solution four our cluster. >>>>>> >>>>>> Disabling and enabling the cephadm module does not work. There are no >>>>>> error messages. When we run "ceph orch..." we get the error message: >>>>>> >>>>>> Error ENOENT: No orchestrator configured (try `ceph orch set >>>>>> backend`) >>>>>> >>>>>> But every single cephadm command works! >>>>>> >>>>>> cephadm ls for example. >>>>>> >>>>>> Stopping and restarting the MGRs did not help. Removing the .asok >>>>>> files did not help. >>>>>> >>>>>> I think of stopping both MGRs and trying to deploy a new MGR like >>>>>> this: >>>>>> >>>>>> https://docs.ceph.com/en/latest/cephadm/troubleshooting/#manually- >>>>>> deploying-a-manager-daemon >>>>>> >>>>>> How could I find the root cause? Is the cephadm somehow broken? >>>>>> >>>>>> What about the cephadm files under /var/lib/ceph/fsid? Can I replace >>>>>> the latest? >>>>>> >>>>>> Best, >>>>>> Malte >>>>>> >>>>>> On 16.10.24 14:54, Malte Stroem wrote: >>>>>>> Hi Laimis, >>>>>>> >>>>>>> that did not work. Still ceph orch does not work. >>>>>>> >>>>>>> Best, >>>>>>> Malte >>>>>>> >>>>>>> On 16.10.24 14:12, Malte Stroem wrote: >>>>>>>> Thank you, Laimis. >>>>>>>> >>>>>>>> And you got the same error message? That's strange. >>>>>>>> >>>>>>>> In the mean time I try to check for clients connected. No >>>>>>>> Kubernetes >>>>>>>> and CephFS, but RGWs. >>>>>>>> >>>>>>>> Best, >>>>>>>> Malte >>>>>>>> >>>>>>>> On 16.10.24 14:01, Laimis Juzeliūnas wrote: >>>>>>>>> Hi Malte, >>>>>>>>> >>>>>>>>> We have faced this recently when upgrading to Squid from latest >>>>>>>>> Reef. >>>>>>>>> As a temporary workaround we disabled the balancer with ‘ceph >>>>>>>>> balancer off’ and restarted mgr daemons. >>>>>>>>> We are suspecting older clients (from Kubernetes RBD mounts as >>>>>>>>> well >>>>>>>>> as CephFS mounts) on servers with incompatible client versions but >>>>>>>>> are yet to dig through it. >>>>>>>>> >>>>>>>>> Best, >>>>>>>>> Laimis J. >>>>>>>>> >>>>>>>>>> On 16 Oct 2024, at 14:57, Malte Stroem <malte.stroem@xxxxxxxxx> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> Error ENOENT: No orchestrator configured (try `ceph orch set >>>>>>>>>> backend`) >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx >>>>>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >>>>>>>> _______________________________________________ >>>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx >>>>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >>>>>>> _______________________________________________ >>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx >>>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >>>>>> _______________________________________________ >>>>>> ceph-users mailing list -- ceph-users@xxxxxxx >>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >>>>> >>>>> >>>>> _______________________________________________ >>>>> ceph-users mailing list -- ceph-users@xxxxxxx >>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >>>> >>>> >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users@xxxxxxx >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >> _______________________________________________ >> ceph-users mailing list -- ceph-users@xxxxxxx >> To unsubscribe send an email to ceph-users-leave@xxxxxxx > > > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx