Re: "ceph orch" not working anymore

Frédéric Nass <frederic.nass@xxxxxxxxxxxxxxxx> · Thu, 17 Oct 2024 21:03:58 +0200 (CEST)

Hi Malte,

Check this solution posted here [1] by Alex.

Cheers,
Frédéric.

[1] https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/message/PEJC7ANB6EHXWE2W4NIGN2VGBGIX4SD4/
________________________________
De : Malte Stroem <malte.stroem@xxxxxxxxx>
Envoyé : jeudi 17 octobre 2024 20:24
À : Eugen Block; ceph-users@xxxxxxx
Objet :  Re: "ceph orch" not working anymore

You're so cool, Eugen. Somehow you seem to find out everything.

Yes, this seems to be the issue and I suspected a bug there.

Looking here:

https://github.com/ceph/ceph/blob/main/src/pybind/mgr/cephadm/services/osd.py

The diff is included in the code.

What can I do now? Get the latest cephadm and put it on the node?

What about the cephadm under /var/lib/ceph/fsid?

I am not sure how to continue.

I would download the latest cephadm and put it under /usr/sbin.

Then disable the module with

ceph mgr module disable cephadm

and enable it

ceph mgr module enable cephadm

Best,
Malte

On 17.10.24 19:20, Eugen Block wrote:
> Oh why didn’t you mention earlier that you removed OSDs? 😄 it sounds 
> like this one:
> 
> https://tracker.ceph.com/issues/67329
> 
> Zitat von Malte Stroem <malte.stroem@xxxxxxxxx>:
> 
>> Hello Redouane,
>>
>> thank you. Interesting.
>>
>> ceph config-key dump
>>
>> shows about 42000 lines.
>>
>> What can I search for? Something with OSDs.
>>
>> But there are thousands of entries.
>>
>> And if I find something, how can I fix that?
>>
>> I think there are entries of the OSDs from the broken node we removed.
>>
>> Best,
>> Malte
>>
>> On 17.10.24 17:46, Redouane Kachach wrote:
>>> So basically it's failing here:
>>>
>>>>  self.to_remove_osds.load_from_store()
>>>
>>> This function is responsible of loading Specs from the mon-store. The
>>> information is stored in json format and it seems the
>>> stored json for the OSD(s) is not valid for some reason. You can see 
>>> what's
>>> stored in the mon-store by running:
>>>
>>>> ceph config-key dump
>>>
>>> Don't share the information publicly here especially if it's a
>>> production cluster as it may have sensitive information about your 
>>> cluster.
>>>
>>> Best,
>>> Redo.
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Thu, Oct 17, 2024 at 5:04 PM Malte Stroem <malte.stroem@xxxxxxxxx> 
>>> wrote:
>>>
>>>> Thanks Eugen & Redouane,
>>>>
>>>> of course I tried enabling and disabling the cephadm module for the 
>>>> MGRs.
>>>>
>>>> Running ceph mgr module enable cephadm produces this output in the 
>>>> MGR log:
>>>>
>>>> -1 mgr load Failed to construct class in 'cephadm'
>>>>   -1 mgr load Traceback (most recent call last):
>>>>    File "/usr/share/ceph/mgr/cephadm/module.py", line 619, in __init__
>>>>      self.to_remove_osds.load_from_store()
>>>>    File "/usr/share/ceph/mgr/cephadm/services/osd.py", line 922, in
>>>> load_from_store
>>>>      for osd in json.loads(v):
>>>>    File "/lib64/python3.9/json/__init__.py", line 346, in loads
>>>>      return _default_decoder.decode(s)
>>>>    File "/lib64/python3.9/json/decoder.py", line 337, in decode
>>>>      obj, end = self.raw_decode(s, idx=_w(s, 0).end())
>>>>    File "/lib64/python3.9/json/decoder.py", line 355, in raw_decode
>>>>      raise JSONDecodeError("Expecting value", s, err.value) from None
>>>> json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
>>>>
>>>>
>>>>   -1 mgr operator() Failed to run module in active mode ('cephadm')
>>>>
>>>> This comes from inside the MGR container because it's Python3.9. On the
>>>> hosts it'S Python3.11.
>>>>
>>>> I think of redeploying an MGR.
>>>>
>>>> Can I stop the existing MGRs?
>>>>
>>>> Redeploying with ceph orch does not work of course, but I think this
>>>> will work:
>>>>
>>>>
>>>> https://docs.ceph.com/en/latest/cephadm/troubleshooting/#manually- 
>>>> deploying-a-manager-daemon
>>>>
>>>> because cephadm standalone is working. Crazy as it sounds.
>>>>
>>>> What do you think?
>>>>
>>>> Best,
>>>> Malte
>>>>
>>>> On 17.10.24 12:49, Eugen Block wrote:
>>>>> Hi,
>>>>>
>>>>> if you just execute cephadm commands, those are issued locally on the
>>>>> hosts, they won't confirm an orchestrator issue immediately.
>>>>> What does the active MGR log? It could show a stack trace or error
>>>>> messages which could point to a root cause.
>>>>>
>>>>>> What about the cephadm files under /var/lib/ceph/fsid? Can I replace
>>>>>> the latest?
>>>>>
>>>>> Those are the cephadm versions the orchestrator actually uses, it will
>>>>> just download them again from your registry (or upstream).
>>>>> Can you share:
>>>>>
>>>>> ceph -s
>>>>> ceph versions
>>>>> MGR logs (active MGR)
>>>>>
>>>>> Thanks,
>>>>> Eugen
>>>>>
>>>>> Zitat von Malte Stroem <malte.stroem@xxxxxxxxx>:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> I am still struggling here and do not know the root cause of this 
>>>>>> issue.
>>>>>>
>>>>>> Searching the list I found lots of people who had the same or a
>>>>>> similar problem the last years.
>>>>>>
>>>>>> However there is no solution four our cluster.
>>>>>>
>>>>>> Disabling and enabling the cephadm module does not work. There are no
>>>>>> error messages. When we run "ceph orch..." we get the error message:
>>>>>>
>>>>>> Error ENOENT: No orchestrator configured (try `ceph orch set 
>>>>>> backend`)
>>>>>>
>>>>>> But every single cephadm command works!
>>>>>>
>>>>>> cephadm ls for example.
>>>>>>
>>>>>> Stopping and restarting the MGRs did not help. Removing the .asok
>>>>>> files did not help.
>>>>>>
>>>>>> I think of stopping both MGRs and trying to deploy a new MGR like 
>>>>>> this:
>>>>>>
>>>>>> https://docs.ceph.com/en/latest/cephadm/troubleshooting/#manually-
>>>>>> deploying-a-manager-daemon
>>>>>>
>>>>>> How could I find the root cause? Is the cephadm somehow broken?
>>>>>>
>>>>>> What about the cephadm files under /var/lib/ceph/fsid? Can I replace
>>>>>> the latest?
>>>>>>
>>>>>> Best,
>>>>>> Malte
>>>>>>
>>>>>> On 16.10.24 14:54, Malte Stroem wrote:
>>>>>>> Hi Laimis,
>>>>>>>
>>>>>>> that did not work. Still ceph orch does not work.
>>>>>>>
>>>>>>> Best,
>>>>>>> Malte
>>>>>>>
>>>>>>> On 16.10.24 14:12, Malte Stroem wrote:
>>>>>>>> Thank you, Laimis.
>>>>>>>>
>>>>>>>> And you got the same error message? That's strange.
>>>>>>>>
>>>>>>>> In the mean time I try to check for clients connected. No 
>>>>>>>> Kubernetes
>>>>>>>> and CephFS, but RGWs.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Malte
>>>>>>>>
>>>>>>>> On 16.10.24 14:01, Laimis Juzeliūnas wrote:
>>>>>>>>> Hi Malte,
>>>>>>>>>
>>>>>>>>> We have faced this recently when upgrading to Squid from latest 
>>>>>>>>> Reef.
>>>>>>>>> As a temporary workaround we disabled the balancer with ‘ceph
>>>>>>>>> balancer off’ and restarted mgr daemons.
>>>>>>>>> We are suspecting older clients (from Kubernetes RBD mounts as 
>>>>>>>>> well
>>>>>>>>> as CephFS mounts) on servers with incompatible client versions but
>>>>>>>>> are yet to dig through it.
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Laimis J.
>>>>>>>>>
>>>>>>>>>> On 16 Oct 2024, at 14:57, Malte Stroem <malte.stroem@xxxxxxxxx>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Error ENOENT: No orchestrator configured (try `ceph orch set
>>>>>>>>>> backend`)
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>>>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>>>>>> _______________________________________________
>>>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>>>>> _______________________________________________
>>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>>
>>>>
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> 
> 
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx