Re: [cephadm] mgr: no daemons active

Satish Patel <satish.txt@xxxxxxxxx> · Fri, 2 Sep 2022 10:25:33 -0400

It Looks like I did it with the following command.

$ ceph orch daemon add mgr ceph2:10.73.0.192

Now i can see two with same version 15.x

root@ceph1:~# ceph orch ps --daemon-type mgr
NAME              HOST   STATUS         REFRESHED  AGE  VERSION  IMAGE NAME

    IMAGE ID      CONTAINER ID
mgr.ceph1.smfvfd  ceph1  running (8h)   41s ago    8h   15.2.17
quay.io/ceph/ceph@sha256:c08064dde4bba4e72a1f55d90ca32df9ef5aafab82efe2e0a0722444a5aaacca
 93146564743f  1aab837306d2
mgr.ceph2.huidoh  ceph2  running (60s)  110s ago   60s  15.2.17
quay.io/ceph/ceph@sha256:c08064dde4bba4e72a1f55d90ca32df9ef5aafab82efe2e0a0722444a5aaacca
 93146564743f  294fd6ab6c97

On Fri, Sep 2, 2022 at 10:19 AM Satish Patel <satish.txt@xxxxxxxxx> wrote:

> Let's come back to the original question: how to bring back the second mgr?
>
> root@ceph1:~# ceph orch apply mgr 2
> Scheduled mgr update...
>
> Nothing happened with above command, logs saying nothing
>
> 2022-09-02T14:16:20.407927+0000 mgr.ceph1.smfvfd (mgr.334626) 16939 :
> cephadm [INF] refreshing ceph2 facts
> 2022-09-02T14:16:40.247195+0000 mgr.ceph1.smfvfd (mgr.334626) 16952 :
> cephadm [INF] Saving service mgr spec with placement count:2
> 2022-09-02T14:16:53.106919+0000 mgr.ceph1.smfvfd (mgr.334626) 16961 :
> cephadm [INF] Saving service mgr spec with placement count:2
> 2022-09-02T14:17:19.135203+0000 mgr.ceph1.smfvfd (mgr.334626) 16975 :
> cephadm [INF] refreshing ceph1 facts
> 2022-09-02T14:17:20.780496+0000 mgr.ceph1.smfvfd (mgr.334626) 16977 :
> cephadm [INF] refreshing ceph2 facts
> 2022-09-02T14:18:19.502034+0000 mgr.ceph1.smfvfd (mgr.334626) 17008 :
> cephadm [INF] refreshing ceph1 facts
> 2022-09-02T14:18:21.127973+0000 mgr.ceph1.smfvfd (mgr.334626) 17010 :
> cephadm [INF] refreshing ceph2 facts
>
>
>
>
>
>
>
> On Fri, Sep 2, 2022 at 10:15 AM Satish Patel <satish.txt@xxxxxxxxx> wrote:
>
>> Hi Adam,
>>
>> Wait..wait.. now it's working suddenly without doing anything.. very odd
>>
>> root@ceph1:~# ceph orch ls
>> NAME                  RUNNING  REFRESHED  AGE  PLACEMENT    IMAGE NAME
>>
>>   IMAGE ID
>> alertmanager              1/1  5s ago     2w   count:1
>> quay.io/prometheus/alertmanager:v0.20.0
>>                    0881eb8f169f
>> crash                     2/2  5s ago     2w   *
>> quay.io/ceph/ceph:v15
>>                    93146564743f
>> grafana                   1/1  5s ago     2w   count:1
>> quay.io/ceph/ceph-grafana:6.7.4
>>                    557c83e11646
>> mgr                       1/2  5s ago     8h   count:2
>> quay.io/ceph/ceph@sha256:c08064dde4bba4e72a1f55d90ca32df9ef5aafab82efe2e0a0722444a5aaacca
>>  93146564743f
>> mon                       1/2  5s ago     8h   ceph1;ceph2
>> quay.io/ceph/ceph:v15
>>                    93146564743f
>> node-exporter             2/2  5s ago     2w   *
>> quay.io/prometheus/node-exporter:v0.18.1
>>                   e5a616e4b9cf
>> osd.osd_spec_default      4/0  5s ago     -    <unmanaged>
>> quay.io/ceph/ceph:v15
>>                    93146564743f
>> prometheus                1/1  5s ago     2w   count:1
>> quay.io/prometheus/prometheus:v2.18.1
>>
>> On Fri, Sep 2, 2022 at 10:13 AM Satish Patel <satish.txt@xxxxxxxxx>
>> wrote:
>>
>>> I can see that in the output but I'm not sure how to get rid of it.
>>>
>>> root@ceph1:~# ceph orch ps --refresh
>>> NAME
>>>  HOST   STATUS        REFRESHED  AGE  VERSION    IMAGE NAME
>>>                                                                 IMAGE ID
>>>    CONTAINER ID
>>> alertmanager.ceph1
>>>  ceph1  running (9h)  64s ago    2w   0.20.0
>>> quay.io/prometheus/alertmanager:v0.20.0
>>>                    0881eb8f169f  ba804b555378
>>> cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d
>>>  ceph2  stopped       65s ago    -    <unknown>  <unknown>
>>>                                                                  <unknown>
>>>     <unknown>
>>> crash.ceph1
>>>   ceph1  running (9h)  64s ago    2w   15.2.17    quay.io/ceph/ceph:v15
>>>
>>>  93146564743f  a3a431d834fc
>>> crash.ceph2
>>>   ceph2  running (9h)  65s ago    13d  15.2.17    quay.io/ceph/ceph:v15
>>>
>>>  93146564743f  3c963693ff2b
>>> grafana.ceph1
>>>   ceph1  running (9h)  64s ago    2w   6.7.4
>>> quay.io/ceph/ceph-grafana:6.7.4
>>>                    557c83e11646  7583a8dc4c61
>>> mgr.ceph1.smfvfd
>>>  ceph1  running (8h)  64s ago    8h   15.2.17
>>> quay.io/ceph/ceph@sha256:c08064dde4bba4e72a1f55d90ca32df9ef5aafab82efe2e0a0722444a5aaacca
>>>  93146564743f  1aab837306d2
>>> mon.ceph1
>>>   ceph1  running (9h)  64s ago    2w   15.2.17    quay.io/ceph/ceph:v15
>>>
>>>  93146564743f  c1d155d8c7ad
>>> node-exporter.ceph1
>>>   ceph1  running (9h)  64s ago    2w   0.18.1
>>> quay.io/prometheus/node-exporter:v0.18.1
>>>                     e5a616e4b9cf  2ff235fe0e42
>>> node-exporter.ceph2
>>>   ceph2  running (9h)  65s ago    13d  0.18.1
>>> quay.io/prometheus/node-exporter:v0.18.1
>>>                     e5a616e4b9cf  17678b9ba602
>>> osd.0
>>>   ceph1  running (9h)  64s ago    13d  15.2.17    quay.io/ceph/ceph:v15
>>>
>>>  93146564743f  d0fd73b777a3
>>> osd.1
>>>   ceph1  running (9h)  64s ago    13d  15.2.17    quay.io/ceph/ceph:v15
>>>
>>>  93146564743f  049120e83102
>>> osd.2
>>>   ceph2  running (9h)  65s ago    13d  15.2.17    quay.io/ceph/ceph:v15
>>>
>>>  93146564743f  8700e8cefd1f
>>> osd.3
>>>   ceph2  running (9h)  65s ago    13d  15.2.17    quay.io/ceph/ceph:v15
>>>
>>>  93146564743f  9c71bc87ed16
>>> prometheus.ceph1
>>>  ceph1  running (9h)  64s ago    2w   2.18.1
>>> quay.io/prometheus/prometheus:v2.18.1
>>>                    de242295e225  74a538efd61e
>>>
>>> On Fri, Sep 2, 2022 at 10:10 AM Adam King <adking@xxxxxxxxxx> wrote:
>>>
>>>> maybe also a "ceph orch ps --refresh"? It might still have the old
>>>> cached daemon inventory from before you remove the files.
>>>>
>>>> On Fri, Sep 2, 2022 at 9:57 AM Satish Patel <satish.txt@xxxxxxxxx>
>>>> wrote:
>>>>
>>>>> Hi Adam,
>>>>>
>>>>> I have deleted file located here - rm
>>>>> /var/lib/ceph/f270ad9e-1f6f-11ed-b6f8-a539d87379ea/cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d
>>>>>
>>>>> But still getting the same error, do i need to do anything else?
>>>>>
>>>>> On Fri, Sep 2, 2022 at 9:51 AM Adam King <adking@xxxxxxxxxx> wrote:
>>>>>
>>>>>> Okay, I'm wondering if this is an issue with version mismatch. Having
>>>>>> previously had a 16.2.10 mgr and then now having a 15.2.17 one that doesn't
>>>>>> expect this sort of thing to be present. Either way, I'd think just
>>>>>> deleting this cephadm.7ce656a8721deb5054c37b0cfb9038
>>>>>> 1522d521dde51fb0c5a2142314d663f63d (and any others like it) file
>>>>>> would be the way forward to get orch ls working again.
>>>>>>
>>>>>> On Fri, Sep 2, 2022 at 9:44 AM Satish Patel <satish.txt@xxxxxxxxx>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Adam,
>>>>>>>
>>>>>>> In cephadm ls i found the following service but i believe it was
>>>>>>> there before also.
>>>>>>>
>>>>>>> {
>>>>>>>         "style": "cephadm:v1",
>>>>>>>         "name":
>>>>>>> "cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d",
>>>>>>>         "fsid": "f270ad9e-1f6f-11ed-b6f8-a539d87379ea",
>>>>>>>         "systemd_unit":
>>>>>>> "ceph-f270ad9e-1f6f-11ed-b6f8-a539d87379ea@cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d
>>>>>>> ",
>>>>>>>         "enabled": false,
>>>>>>>         "state": "stopped",
>>>>>>>         "container_id": null,
>>>>>>>         "container_image_name": null,
>>>>>>>         "container_image_id": null,
>>>>>>>         "version": null,
>>>>>>>         "started": null,
>>>>>>>         "created": null,
>>>>>>>         "deployed": null,
>>>>>>>         "configured": null
>>>>>>>     },
>>>>>>>
>>>>>>> Look like remove didn't work
>>>>>>>
>>>>>>> root@ceph1:~# ceph orch rm cephadm
>>>>>>> Failed to remove service. <cephadm> was not found.
>>>>>>>
>>>>>>> root@ceph1:~# ceph orch rm
>>>>>>> cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d
>>>>>>> Failed to remove service.
>>>>>>> <cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d>
>>>>>>> was not found.
>>>>>>>
>>>>>>> On Fri, Sep 2, 2022 at 8:27 AM Adam King <adking@xxxxxxxxxx> wrote:
>>>>>>>
>>>>>>>> this looks like an old traceback you would get if you ended up with
>>>>>>>> a service type that shouldn't be there somehow. The things I'd probably
>>>>>>>> check are that "cephadm ls" on either host definitely doesn't report and
>>>>>>>> strange things that aren't actually daemons in your cluster such as
>>>>>>>> "cephadm.<hash>". Another thing you could maybe try, as I believe the
>>>>>>>> assertion it's giving is for an unknown service type here ("AssertionError:
>>>>>>>> cephadm"), is just "ceph orch rm cephadm" which would maybe cause it to
>>>>>>>> remove whatever it thinks is this "cephadm" service that it has deployed.
>>>>>>>> Lastly, you could try having the mgr you manually deploy be a 16.2.10 one
>>>>>>>> instead of 15.2.17 (I'm assuming here, but the line numbers in that
>>>>>>>> traceback suggest octopus). The 16.2.10 one is just much less likely to
>>>>>>>> have a bug that causes something like this.
>>>>>>>>
>>>>>>>> On Fri, Sep 2, 2022 at 1:41 AM Satish Patel <satish.txt@xxxxxxxxx>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Now when I run "ceph orch ps" it works but the following command
>>>>>>>>> throws an
>>>>>>>>> error.  Trying to bring up second mgr using ceph orch apply mgr
>>>>>>>>> command but
>>>>>>>>> didn't help
>>>>>>>>>
>>>>>>>>> root@ceph1:/ceph-disk# ceph version
>>>>>>>>> ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4)
>>>>>>>>> octopus
>>>>>>>>> (stable)
>>>>>>>>>
>>>>>>>>> root@ceph1:/ceph-disk# ceph orch ls
>>>>>>>>> Error EINVAL: Traceback (most recent call last):
>>>>>>>>>   File "/usr/share/ceph/mgr/mgr_module.py", line 1212, in
>>>>>>>>> _handle_command
>>>>>>>>>     return self.handle_command(inbuf, cmd)
>>>>>>>>>   File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 140,
>>>>>>>>> in
>>>>>>>>> handle_command
>>>>>>>>>     return dispatch[cmd['prefix']].call(self, cmd, inbuf)
>>>>>>>>>   File "/usr/share/ceph/mgr/mgr_module.py", line 320, in call
>>>>>>>>>     return self.func(mgr, **kwargs)
>>>>>>>>>   File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 102,
>>>>>>>>> in
>>>>>>>>> <lambda>
>>>>>>>>>     wrapper_copy = lambda *l_args, **l_kwargs: wrapper(*l_args,
>>>>>>>>> **l_kwargs)
>>>>>>>>>   File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 91,
>>>>>>>>> in wrapper
>>>>>>>>>     return func(*args, **kwargs)
>>>>>>>>>   File "/usr/share/ceph/mgr/orchestrator/module.py", line 503, in
>>>>>>>>> _list_services
>>>>>>>>>     raise_if_exception(completion)
>>>>>>>>>   File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 642,
>>>>>>>>> in
>>>>>>>>> raise_if_exception
>>>>>>>>>     raise e
>>>>>>>>> AssertionError: cephadm
>>>>>>>>>
>>>>>>>>> On Fri, Sep 2, 2022 at 1:32 AM Satish Patel <satish.txt@xxxxxxxxx>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> > nevermind, i found doc related that and i am able to get 1 mgr
>>>>>>>>> up -
>>>>>>>>> >
>>>>>>>>> https://docs.ceph.com/en/quincy/cephadm/troubleshooting/#manually-deploying-a-mgr-daemon
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> > On Fri, Sep 2, 2022 at 1:21 AM Satish Patel <
>>>>>>>>> satish.txt@xxxxxxxxx> wrote:
>>>>>>>>> >
>>>>>>>>> >> Folks,
>>>>>>>>> >>
>>>>>>>>> >> I am having little fun time with cephadm and it's very annoying
>>>>>>>>> to deal
>>>>>>>>> >> with it
>>>>>>>>> >>
>>>>>>>>> >> I have deployed a ceph cluster using cephadm on two nodes. Now
>>>>>>>>> when i was
>>>>>>>>> >> trying to upgrade and noticed hiccups where it just upgraded a
>>>>>>>>> single mgr
>>>>>>>>> >> with 16.2.10 but not other so i started messing around and
>>>>>>>>> somehow I
>>>>>>>>> >> deleted both mgr in the thought that cephadm will recreate them.
>>>>>>>>> >>
>>>>>>>>> >> Now i don't have any single mgr so my ceph orch command hangs
>>>>>>>>> forever and
>>>>>>>>> >> looks like a chicken egg issue.
>>>>>>>>> >>
>>>>>>>>> >> How do I recover from this? If I can't run the ceph orch
>>>>>>>>> command, I won't
>>>>>>>>> >> be able to redeploy my mgr daemons.
>>>>>>>>> >>
>>>>>>>>> >> I am not able to find any mgr in the following command on both
>>>>>>>>> nodes.
>>>>>>>>> >>
>>>>>>>>> >> $ cephadm ls | grep mgr
>>>>>>>>> >>
>>>>>>>>> >
>>>>>>>>> _______________________________________________
>>>>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>>>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>>>>>>>
>>>>>>>>>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx