Re: "ceph orch" not working anymore

Malte Stroem <malte.stroem@xxxxxxxxx> · Thu, 17 Oct 2024 18:58:19 +0200

Hello Redouane,

thank you. Interesting.

ceph config-key dump

shows about 42000 lines.

What can I search for? Something with OSDs.

But there are thousands of entries.

And if I find something, how can I fix that?

I think there are entries of the OSDs from the broken node we removed.

Best,
Malte

On 17.10.24 17:46, Redouane Kachach wrote:
So basically it's failing here:

  self.to_remove_osds.load_from_store()

This function is responsible of loading Specs from the mon-store. The
information is stored in json format and it seems the
stored json for the OSD(s) is not valid for some reason. You can see what's
stored in the mon-store by running:

ceph config-key dump

Don't share the information publicly here especially if it's a
production cluster as it may have sensitive information about your cluster.

Best,
Redo.

On Thu, Oct 17, 2024 at 5:04 PM Malte Stroem <malte.stroem@xxxxxxxxx> wrote:

Thanks Eugen & Redouane,

of course I tried enabling and disabling the cephadm module for the MGRs.

Running ceph mgr module enable cephadm produces this output in the MGR log:

-1 mgr load Failed to construct class in 'cephadm'
   -1 mgr load Traceback (most recent call last):
    File "/usr/share/ceph/mgr/cephadm/module.py", line 619, in __init__
      self.to_remove_osds.load_from_store()
    File "/usr/share/ceph/mgr/cephadm/services/osd.py", line 922, in
load_from_store
      for osd in json.loads(v):
    File "/lib64/python3.9/json/__init__.py", line 346, in loads
      return _default_decoder.decode(s)
    File "/lib64/python3.9/json/decoder.py", line 337, in decode
      obj, end = self.raw_decode(s, idx=_w(s, 0).end())
    File "/lib64/python3.9/json/decoder.py", line 355, in raw_decode
      raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

   -1 mgr operator() Failed to run module in active mode ('cephadm')

This comes from inside the MGR container because it's Python3.9. On the
hosts it'S Python3.11.

I think of redeploying an MGR.

Can I stop the existing MGRs?

Redeploying with ceph orch does not work of course, but I think this
will work:

https://docs.ceph.com/en/latest/cephadm/troubleshooting/#manually-deploying-a-manager-daemon

because cephadm standalone is working. Crazy as it sounds.

What do you think?

Best,
Malte

On 17.10.24 12:49, Eugen Block wrote:
Hi,

if you just execute cephadm commands, those are issued locally on the
hosts, they won't confirm an orchestrator issue immediately.
What does the active MGR log? It could show a stack trace or error
messages which could point to a root cause.

What about the cephadm files under /var/lib/ceph/fsid? Can I replace
the latest?

Those are the cephadm versions the orchestrator actually uses, it will
just download them again from your registry (or upstream).
Can you share:

ceph -s
ceph versions
MGR logs (active MGR)

Thanks,
Eugen

Zitat von Malte Stroem <malte.stroem@xxxxxxxxx>:

Hello,

I am still struggling here and do not know the root cause of this issue.

Searching the list I found lots of people who had the same or a
similar problem the last years.

However there is no solution four our cluster.

Disabling and enabling the cephadm module does not work. There are no
error messages. When we run "ceph orch..." we get the error message:

Error ENOENT: No orchestrator configured (try `ceph orch set backend`)

But every single cephadm command works!

cephadm ls for example.

Stopping and restarting the MGRs did not help. Removing the .asok
files did not help.

I think of stopping both MGRs and trying to deploy a new MGR like this:

https://docs.ceph.com/en/latest/cephadm/troubleshooting/#manually-
deploying-a-manager-daemon

How could I find the root cause? Is the cephadm somehow broken?

What about the cephadm files under /var/lib/ceph/fsid? Can I replace
the latest?

Best,
Malte

On 16.10.24 14:54, Malte Stroem wrote:
Hi Laimis,

that did not work. Still ceph orch does not work.

Best,
Malte

On 16.10.24 14:12, Malte Stroem wrote:
Thank you, Laimis.

And you got the same error message? That's strange.

In the mean time I try to check for clients connected. No Kubernetes
and CephFS, but RGWs.

Best,
Malte

On 16.10.24 14:01, Laimis Juzeliūnas wrote:
Hi Malte,

We have faced this recently when upgrading to Squid from latest Reef.
As a temporary workaround we disabled the balancer with ‘ceph
balancer off’ and restarted mgr daemons.
We are suspecting older clients (from Kubernetes RBD mounts as well
as CephFS mounts) on servers with incompatible client versions but
are yet to dig through it.

Best,
Laimis J.

On 16 Oct 2024, at 14:57, Malte Stroem <malte.stroem@xxxxxxxxx>
wrote:

Error ENOENT: No orchestrator configured (try `ceph orch set
backend`)

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx