Re: "ceph orch" not working anymore

Redouane Kachach <rkachach@xxxxxxxxxx> · Thu, 17 Oct 2024 17:46:20 +0200

So basically it's failing here:

>  self.to_remove_osds.load_from_store()

This function is responsible of loading Specs from the mon-store. The
information is stored in json format and it seems the
stored json for the OSD(s) is not valid for some reason. You can see what's
stored in the mon-store by running:

> ceph config-key dump

Don't share the information publicly here especially if it's a
production cluster as it may have sensitive information about your cluster.

Best,
Redo.

On Thu, Oct 17, 2024 at 5:04 PM Malte Stroem <malte.stroem@xxxxxxxxx> wrote:

> Thanks Eugen & Redouane,
>
> of course I tried enabling and disabling the cephadm module for the MGRs.
>
> Running ceph mgr module enable cephadm produces this output in the MGR log:
>
> -1 mgr load Failed to construct class in 'cephadm'
>   -1 mgr load Traceback (most recent call last):
>    File "/usr/share/ceph/mgr/cephadm/module.py", line 619, in __init__
>      self.to_remove_osds.load_from_store()
>    File "/usr/share/ceph/mgr/cephadm/services/osd.py", line 922, in
> load_from_store
>      for osd in json.loads(v):
>    File "/lib64/python3.9/json/__init__.py", line 346, in loads
>      return _default_decoder.decode(s)
>    File "/lib64/python3.9/json/decoder.py", line 337, in decode
>      obj, end = self.raw_decode(s, idx=_w(s, 0).end())
>    File "/lib64/python3.9/json/decoder.py", line 355, in raw_decode
>      raise JSONDecodeError("Expecting value", s, err.value) from None
> json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
>
>
>   -1 mgr operator() Failed to run module in active mode ('cephadm')
>
> This comes from inside the MGR container because it's Python3.9. On the
> hosts it'S Python3.11.
>
> I think of redeploying an MGR.
>
> Can I stop the existing MGRs?
>
> Redeploying with ceph orch does not work of course, but I think this
> will work:
>
>
> https://docs.ceph.com/en/latest/cephadm/troubleshooting/#manually-deploying-a-manager-daemon
>
> because cephadm standalone is working. Crazy as it sounds.
>
> What do you think?
>
> Best,
> Malte
>
> On 17.10.24 12:49, Eugen Block wrote:
> > Hi,
> >
> > if you just execute cephadm commands, those are issued locally on the
> > hosts, they won't confirm an orchestrator issue immediately.
> > What does the active MGR log? It could show a stack trace or error
> > messages which could point to a root cause.
> >
> >> What about the cephadm files under /var/lib/ceph/fsid? Can I replace
> >> the latest?
> >
> > Those are the cephadm versions the orchestrator actually uses, it will
> > just download them again from your registry (or upstream).
> > Can you share:
> >
> > ceph -s
> > ceph versions
> > MGR logs (active MGR)
> >
> > Thanks,
> > Eugen
> >
> > Zitat von Malte Stroem <malte.stroem@xxxxxxxxx>:
> >
> >> Hello,
> >>
> >> I am still struggling here and do not know the root cause of this issue.
> >>
> >> Searching the list I found lots of people who had the same or a
> >> similar problem the last years.
> >>
> >> However there is no solution four our cluster.
> >>
> >> Disabling and enabling the cephadm module does not work. There are no
> >> error messages. When we run "ceph orch..." we get the error message:
> >>
> >> Error ENOENT: No orchestrator configured (try `ceph orch set backend`)
> >>
> >> But every single cephadm command works!
> >>
> >> cephadm ls for example.
> >>
> >> Stopping and restarting the MGRs did not help. Removing the .asok
> >> files did not help.
> >>
> >> I think of stopping both MGRs and trying to deploy a new MGR like this:
> >>
> >> https://docs.ceph.com/en/latest/cephadm/troubleshooting/#manually-
> >> deploying-a-manager-daemon
> >>
> >> How could I find the root cause? Is the cephadm somehow broken?
> >>
> >> What about the cephadm files under /var/lib/ceph/fsid? Can I replace
> >> the latest?
> >>
> >> Best,
> >> Malte
> >>
> >> On 16.10.24 14:54, Malte Stroem wrote:
> >>> Hi Laimis,
> >>>
> >>> that did not work. Still ceph orch does not work.
> >>>
> >>> Best,
> >>> Malte
> >>>
> >>> On 16.10.24 14:12, Malte Stroem wrote:
> >>>> Thank you, Laimis.
> >>>>
> >>>> And you got the same error message? That's strange.
> >>>>
> >>>> In the mean time I try to check for clients connected. No Kubernetes
> >>>> and CephFS, but RGWs.
> >>>>
> >>>> Best,
> >>>> Malte
> >>>>
> >>>> On 16.10.24 14:01, Laimis Juzeliūnas wrote:
> >>>>> Hi Malte,
> >>>>>
> >>>>> We have faced this recently when upgrading to Squid from latest Reef.
> >>>>> As a temporary workaround we disabled the balancer with ‘ceph
> >>>>> balancer off’ and restarted mgr daemons.
> >>>>> We are suspecting older clients (from Kubernetes RBD mounts as well
> >>>>> as CephFS mounts) on servers with incompatible client versions but
> >>>>> are yet to dig through it.
> >>>>>
> >>>>> Best,
> >>>>> Laimis J.
> >>>>>
> >>>>>> On 16 Oct 2024, at 14:57, Malte Stroem <malte.stroem@xxxxxxxxx>
> >>>>>> wrote:
> >>>>>>
> >>>>>> Error ENOENT: No orchestrator configured (try `ceph orch set
> >>>>>> backend`)
> >>>>>
> >>>>> _______________________________________________
> >>>>> ceph-users mailing list -- ceph-users@xxxxxxx
> >>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >>>> _______________________________________________
> >>>> ceph-users mailing list -- ceph-users@xxxxxxx
> >>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >>> _______________________________________________
> >>> ceph-users mailing list -- ceph-users@xxxxxxx
> >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >> _______________________________________________
> >> ceph-users mailing list -- ceph-users@xxxxxxx
> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >
> >
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx