Re: [cephadm] mgr: no daemons active

Satish Patel <satish.txt@xxxxxxxxx> · Fri, 2 Sep 2022 10:15:24 -0400

Hi Adam,

Wait..wait.. now it's working suddenly without doing anything.. very odd

root@ceph1:~# ceph orch ls
NAME                  RUNNING  REFRESHED  AGE  PLACEMENT    IMAGE NAME

IMAGE ID
alertmanager              1/1  5s ago     2w   count:1
quay.io/prometheus/alertmanager:v0.20.0
               0881eb8f169f
crash                     2/2  5s ago     2w   *
quay.io/ceph/ceph:v15
               93146564743f
grafana                   1/1  5s ago     2w   count:1
quay.io/ceph/ceph-grafana:6.7.4
               557c83e11646
mgr                       1/2  5s ago     8h   count:2
quay.io/ceph/ceph@sha256:c08064dde4bba4e72a1f55d90ca32df9ef5aafab82efe2e0a0722444a5aaacca
 93146564743f
mon                       1/2  5s ago     8h   ceph1;ceph2
quay.io/ceph/ceph:v15
               93146564743f
node-exporter             2/2  5s ago     2w   *
quay.io/prometheus/node-exporter:v0.18.1
                e5a616e4b9cf
osd.osd_spec_default      4/0  5s ago     -    <unmanaged>
quay.io/ceph/ceph:v15
               93146564743f
prometheus                1/1  5s ago     2w   count:1
quay.io/prometheus/prometheus:v2.18.1

On Fri, Sep 2, 2022 at 10:13 AM Satish Patel <satish.txt@xxxxxxxxx> wrote:

> I can see that in the output but I'm not sure how to get rid of it.
>
> root@ceph1:~# ceph orch ps --refresh
> NAME
>  HOST   STATUS        REFRESHED  AGE  VERSION    IMAGE NAME
>                                                                 IMAGE ID
>    CONTAINER ID
> alertmanager.ceph1
>  ceph1  running (9h)  64s ago    2w   0.20.0
> quay.io/prometheus/alertmanager:v0.20.0
>                  0881eb8f169f  ba804b555378
> cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d
>  ceph2  stopped       65s ago    -    <unknown>  <unknown>
>                                                                  <unknown>
>     <unknown>
> crash.ceph1
> ceph1  running (9h)  64s ago    2w   15.2.17    quay.io/ceph/ceph:v15
>
>  93146564743f  a3a431d834fc
> crash.ceph2
> ceph2  running (9h)  65s ago    13d  15.2.17    quay.io/ceph/ceph:v15
>
>  93146564743f  3c963693ff2b
> grafana.ceph1
> ceph1  running (9h)  64s ago    2w   6.7.4
> quay.io/ceph/ceph-grafana:6.7.4
>                  557c83e11646  7583a8dc4c61
> mgr.ceph1.smfvfd
>  ceph1  running (8h)  64s ago    8h   15.2.17
> quay.io/ceph/ceph@sha256:c08064dde4bba4e72a1f55d90ca32df9ef5aafab82efe2e0a0722444a5aaacca
>  93146564743f  1aab837306d2
> mon.ceph1
> ceph1  running (9h)  64s ago    2w   15.2.17    quay.io/ceph/ceph:v15
>
>  93146564743f  c1d155d8c7ad
> node-exporter.ceph1
> ceph1  running (9h)  64s ago    2w   0.18.1
> quay.io/prometheus/node-exporter:v0.18.1
>                   e5a616e4b9cf  2ff235fe0e42
> node-exporter.ceph2
> ceph2  running (9h)  65s ago    13d  0.18.1
> quay.io/prometheus/node-exporter:v0.18.1
>                   e5a616e4b9cf  17678b9ba602
> osd.0
> ceph1  running (9h)  64s ago    13d  15.2.17    quay.io/ceph/ceph:v15
>
>  93146564743f  d0fd73b777a3
> osd.1
> ceph1  running (9h)  64s ago    13d  15.2.17    quay.io/ceph/ceph:v15
>
>  93146564743f  049120e83102
> osd.2
> ceph2  running (9h)  65s ago    13d  15.2.17    quay.io/ceph/ceph:v15
>
>  93146564743f  8700e8cefd1f
> osd.3
> ceph2  running (9h)  65s ago    13d  15.2.17    quay.io/ceph/ceph:v15
>
>  93146564743f  9c71bc87ed16
> prometheus.ceph1
>  ceph1  running (9h)  64s ago    2w   2.18.1
> quay.io/prometheus/prometheus:v2.18.1
>                  de242295e225  74a538efd61e
>
> On Fri, Sep 2, 2022 at 10:10 AM Adam King <adking@xxxxxxxxxx> wrote:
>
>> maybe also a "ceph orch ps --refresh"? It might still have the old cached
>> daemon inventory from before you remove the files.
>>
>> On Fri, Sep 2, 2022 at 9:57 AM Satish Patel <satish.txt@xxxxxxxxx> wrote:
>>
>>> Hi Adam,
>>>
>>> I have deleted file located here - rm
>>> /var/lib/ceph/f270ad9e-1f6f-11ed-b6f8-a539d87379ea/cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d
>>>
>>> But still getting the same error, do i need to do anything else?
>>>
>>> On Fri, Sep 2, 2022 at 9:51 AM Adam King <adking@xxxxxxxxxx> wrote:
>>>
>>>> Okay, I'm wondering if this is an issue with version mismatch. Having
>>>> previously had a 16.2.10 mgr and then now having a 15.2.17 one that doesn't
>>>> expect this sort of thing to be present. Either way, I'd think just
>>>> deleting this cephadm.7ce656a8721deb5054c37b0cfb9038
>>>> 1522d521dde51fb0c5a2142314d663f63d (and any others like it) file would
>>>> be the way forward to get orch ls working again.
>>>>
>>>> On Fri, Sep 2, 2022 at 9:44 AM Satish Patel <satish.txt@xxxxxxxxx>
>>>> wrote:
>>>>
>>>>> Hi Adam,
>>>>>
>>>>> In cephadm ls i found the following service but i believe it was there
>>>>> before also.
>>>>>
>>>>> {
>>>>>         "style": "cephadm:v1",
>>>>>         "name":
>>>>> "cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d",
>>>>>         "fsid": "f270ad9e-1f6f-11ed-b6f8-a539d87379ea",
>>>>>         "systemd_unit":
>>>>> "ceph-f270ad9e-1f6f-11ed-b6f8-a539d87379ea@cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d
>>>>> ",
>>>>>         "enabled": false,
>>>>>         "state": "stopped",
>>>>>         "container_id": null,
>>>>>         "container_image_name": null,
>>>>>         "container_image_id": null,
>>>>>         "version": null,
>>>>>         "started": null,
>>>>>         "created": null,
>>>>>         "deployed": null,
>>>>>         "configured": null
>>>>>     },
>>>>>
>>>>> Look like remove didn't work
>>>>>
>>>>> root@ceph1:~# ceph orch rm cephadm
>>>>> Failed to remove service. <cephadm> was not found.
>>>>>
>>>>> root@ceph1:~# ceph orch rm
>>>>> cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d
>>>>> Failed to remove service.
>>>>> <cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d>
>>>>> was not found.
>>>>>
>>>>> On Fri, Sep 2, 2022 at 8:27 AM Adam King <adking@xxxxxxxxxx> wrote:
>>>>>
>>>>>> this looks like an old traceback you would get if you ended up with a
>>>>>> service type that shouldn't be there somehow. The things I'd probably check
>>>>>> are that "cephadm ls" on either host definitely doesn't report and strange
>>>>>> things that aren't actually daemons in your cluster such as
>>>>>> "cephadm.<hash>". Another thing you could maybe try, as I believe the
>>>>>> assertion it's giving is for an unknown service type here ("AssertionError:
>>>>>> cephadm"), is just "ceph orch rm cephadm" which would maybe cause it to
>>>>>> remove whatever it thinks is this "cephadm" service that it has deployed.
>>>>>> Lastly, you could try having the mgr you manually deploy be a 16.2.10 one
>>>>>> instead of 15.2.17 (I'm assuming here, but the line numbers in that
>>>>>> traceback suggest octopus). The 16.2.10 one is just much less likely to
>>>>>> have a bug that causes something like this.
>>>>>>
>>>>>> On Fri, Sep 2, 2022 at 1:41 AM Satish Patel <satish.txt@xxxxxxxxx>
>>>>>> wrote:
>>>>>>
>>>>>>> Now when I run "ceph orch ps" it works but the following command
>>>>>>> throws an
>>>>>>> error.  Trying to bring up second mgr using ceph orch apply mgr
>>>>>>> command but
>>>>>>> didn't help
>>>>>>>
>>>>>>> root@ceph1:/ceph-disk# ceph version
>>>>>>> ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4)
>>>>>>> octopus
>>>>>>> (stable)
>>>>>>>
>>>>>>> root@ceph1:/ceph-disk# ceph orch ls
>>>>>>> Error EINVAL: Traceback (most recent call last):
>>>>>>>   File "/usr/share/ceph/mgr/mgr_module.py", line 1212, in
>>>>>>> _handle_command
>>>>>>>     return self.handle_command(inbuf, cmd)
>>>>>>>   File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 140, in
>>>>>>> handle_command
>>>>>>>     return dispatch[cmd['prefix']].call(self, cmd, inbuf)
>>>>>>>   File "/usr/share/ceph/mgr/mgr_module.py", line 320, in call
>>>>>>>     return self.func(mgr, **kwargs)
>>>>>>>   File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 102, in
>>>>>>> <lambda>
>>>>>>>     wrapper_copy = lambda *l_args, **l_kwargs: wrapper(*l_args,
>>>>>>> **l_kwargs)
>>>>>>>   File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 91, in
>>>>>>> wrapper
>>>>>>>     return func(*args, **kwargs)
>>>>>>>   File "/usr/share/ceph/mgr/orchestrator/module.py", line 503, in
>>>>>>> _list_services
>>>>>>>     raise_if_exception(completion)
>>>>>>>   File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 642, in
>>>>>>> raise_if_exception
>>>>>>>     raise e
>>>>>>> AssertionError: cephadm
>>>>>>>
>>>>>>> On Fri, Sep 2, 2022 at 1:32 AM Satish Patel <satish.txt@xxxxxxxxx>
>>>>>>> wrote:
>>>>>>>
>>>>>>> > nevermind, i found doc related that and i am able to get 1 mgr up -
>>>>>>> >
>>>>>>> https://docs.ceph.com/en/quincy/cephadm/troubleshooting/#manually-deploying-a-mgr-daemon
>>>>>>> >
>>>>>>> >
>>>>>>> > On Fri, Sep 2, 2022 at 1:21 AM Satish Patel <satish.txt@xxxxxxxxx>
>>>>>>> wrote:
>>>>>>> >
>>>>>>> >> Folks,
>>>>>>> >>
>>>>>>> >> I am having little fun time with cephadm and it's very annoying
>>>>>>> to deal
>>>>>>> >> with it
>>>>>>> >>
>>>>>>> >> I have deployed a ceph cluster using cephadm on two nodes. Now
>>>>>>> when i was
>>>>>>> >> trying to upgrade and noticed hiccups where it just upgraded a
>>>>>>> single mgr
>>>>>>> >> with 16.2.10 but not other so i started messing around and
>>>>>>> somehow I
>>>>>>> >> deleted both mgr in the thought that cephadm will recreate them.
>>>>>>> >>
>>>>>>> >> Now i don't have any single mgr so my ceph orch command hangs
>>>>>>> forever and
>>>>>>> >> looks like a chicken egg issue.
>>>>>>> >>
>>>>>>> >> How do I recover from this? If I can't run the ceph orch command,
>>>>>>> I won't
>>>>>>> >> be able to redeploy my mgr daemons.
>>>>>>> >>
>>>>>>> >> I am not able to find any mgr in the following command on both
>>>>>>> nodes.
>>>>>>> >>
>>>>>>> >> $ cephadm ls | grep mgr
>>>>>>> >>
>>>>>>> >
>>>>>>> _______________________________________________
>>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>>>>>
>>>>>>>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx