Re: [cephadm] mgr: no daemons active

Satish Patel <satish.txt@xxxxxxxxx> · Fri, 2 Sep 2022 10:19:02 -0400

Let's come back to the original question: how to bring back the second mgr?

root@ceph1:~# ceph orch apply mgr 2
Scheduled mgr update...

Nothing happened with above command, logs saying nothing

2022-09-02T14:16:20.407927+0000 mgr.ceph1.smfvfd (mgr.334626) 16939 :
cephadm [INF] refreshing ceph2 facts
2022-09-02T14:16:40.247195+0000 mgr.ceph1.smfvfd (mgr.334626) 16952 :
cephadm [INF] Saving service mgr spec with placement count:2
2022-09-02T14:16:53.106919+0000 mgr.ceph1.smfvfd (mgr.334626) 16961 :
cephadm [INF] Saving service mgr spec with placement count:2
2022-09-02T14:17:19.135203+0000 mgr.ceph1.smfvfd (mgr.334626) 16975 :
cephadm [INF] refreshing ceph1 facts
2022-09-02T14:17:20.780496+0000 mgr.ceph1.smfvfd (mgr.334626) 16977 :
cephadm [INF] refreshing ceph2 facts
2022-09-02T14:18:19.502034+0000 mgr.ceph1.smfvfd (mgr.334626) 17008 :
cephadm [INF] refreshing ceph1 facts
2022-09-02T14:18:21.127973+0000 mgr.ceph1.smfvfd (mgr.334626) 17010 :
cephadm [INF] refreshing ceph2 facts

On Fri, Sep 2, 2022 at 10:15 AM Satish Patel <satish.txt@xxxxxxxxx> wrote:

> Hi Adam,
>
> Wait..wait.. now it's working suddenly without doing anything.. very odd
>
> root@ceph1:~# ceph orch ls
> NAME                  RUNNING  REFRESHED  AGE  PLACEMENT    IMAGE NAME
>
> IMAGE ID
> alertmanager              1/1  5s ago     2w   count:1
> quay.io/prometheus/alertmanager:v0.20.0
>                  0881eb8f169f
> crash                     2/2  5s ago     2w   *
> quay.io/ceph/ceph:v15
>                  93146564743f
> grafana                   1/1  5s ago     2w   count:1
> quay.io/ceph/ceph-grafana:6.7.4
>                  557c83e11646
> mgr                       1/2  5s ago     8h   count:2
> quay.io/ceph/ceph@sha256:c08064dde4bba4e72a1f55d90ca32df9ef5aafab82efe2e0a0722444a5aaacca
>  93146564743f
> mon                       1/2  5s ago     8h   ceph1;ceph2
> quay.io/ceph/ceph:v15
>                  93146564743f
> node-exporter             2/2  5s ago     2w   *
> quay.io/prometheus/node-exporter:v0.18.1
>                   e5a616e4b9cf
> osd.osd_spec_default      4/0  5s ago     -    <unmanaged>
> quay.io/ceph/ceph:v15
>                  93146564743f
> prometheus                1/1  5s ago     2w   count:1
> quay.io/prometheus/prometheus:v2.18.1
>
> On Fri, Sep 2, 2022 at 10:13 AM Satish Patel <satish.txt@xxxxxxxxx> wrote:
>
>> I can see that in the output but I'm not sure how to get rid of it.
>>
>> root@ceph1:~# ceph orch ps --refresh
>> NAME
>>  HOST   STATUS        REFRESHED  AGE  VERSION    IMAGE NAME
>>                                                                 IMAGE ID
>>    CONTAINER ID
>> alertmanager.ceph1
>>  ceph1  running (9h)  64s ago    2w   0.20.0
>> quay.io/prometheus/alertmanager:v0.20.0
>>                    0881eb8f169f  ba804b555378
>> cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d
>>  ceph2  stopped       65s ago    -    <unknown>  <unknown>
>>                                                                  <unknown>
>>     <unknown>
>> crash.ceph1
>> ceph1  running (9h)  64s ago    2w   15.2.17    quay.io/ceph/ceph:v15
>>
>>  93146564743f  a3a431d834fc
>> crash.ceph2
>> ceph2  running (9h)  65s ago    13d  15.2.17    quay.io/ceph/ceph:v15
>>
>>  93146564743f  3c963693ff2b
>> grafana.ceph1
>> ceph1  running (9h)  64s ago    2w   6.7.4
>> quay.io/ceph/ceph-grafana:6.7.4
>>                    557c83e11646  7583a8dc4c61
>> mgr.ceph1.smfvfd
>>  ceph1  running (8h)  64s ago    8h   15.2.17
>> quay.io/ceph/ceph@sha256:c08064dde4bba4e72a1f55d90ca32df9ef5aafab82efe2e0a0722444a5aaacca
>>  93146564743f  1aab837306d2
>> mon.ceph1
>> ceph1  running (9h)  64s ago    2w   15.2.17    quay.io/ceph/ceph:v15
>>
>>  93146564743f  c1d155d8c7ad
>> node-exporter.ceph1
>> ceph1  running (9h)  64s ago    2w   0.18.1
>> quay.io/prometheus/node-exporter:v0.18.1
>>                   e5a616e4b9cf  2ff235fe0e42
>> node-exporter.ceph2
>> ceph2  running (9h)  65s ago    13d  0.18.1
>> quay.io/prometheus/node-exporter:v0.18.1
>>                   e5a616e4b9cf  17678b9ba602
>> osd.0
>> ceph1  running (9h)  64s ago    13d  15.2.17    quay.io/ceph/ceph:v15
>>
>>  93146564743f  d0fd73b777a3
>> osd.1
>> ceph1  running (9h)  64s ago    13d  15.2.17    quay.io/ceph/ceph:v15
>>
>>  93146564743f  049120e83102
>> osd.2
>> ceph2  running (9h)  65s ago    13d  15.2.17    quay.io/ceph/ceph:v15
>>
>>  93146564743f  8700e8cefd1f
>> osd.3
>> ceph2  running (9h)  65s ago    13d  15.2.17    quay.io/ceph/ceph:v15
>>
>>  93146564743f  9c71bc87ed16
>> prometheus.ceph1
>>  ceph1  running (9h)  64s ago    2w   2.18.1
>> quay.io/prometheus/prometheus:v2.18.1
>>                    de242295e225  74a538efd61e
>>
>> On Fri, Sep 2, 2022 at 10:10 AM Adam King <adking@xxxxxxxxxx> wrote:
>>
>>> maybe also a "ceph orch ps --refresh"? It might still have the old
>>> cached daemon inventory from before you remove the files.
>>>
>>> On Fri, Sep 2, 2022 at 9:57 AM Satish Patel <satish.txt@xxxxxxxxx>
>>> wrote:
>>>
>>>> Hi Adam,
>>>>
>>>> I have deleted file located here - rm
>>>> /var/lib/ceph/f270ad9e-1f6f-11ed-b6f8-a539d87379ea/cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d
>>>>
>>>> But still getting the same error, do i need to do anything else?
>>>>
>>>> On Fri, Sep 2, 2022 at 9:51 AM Adam King <adking@xxxxxxxxxx> wrote:
>>>>
>>>>> Okay, I'm wondering if this is an issue with version mismatch. Having
>>>>> previously had a 16.2.10 mgr and then now having a 15.2.17 one that doesn't
>>>>> expect this sort of thing to be present. Either way, I'd think just
>>>>> deleting this cephadm.7ce656a8721deb5054c37b0cfb9038
>>>>> 1522d521dde51fb0c5a2142314d663f63d (and any others like it) file
>>>>> would be the way forward to get orch ls working again.
>>>>>
>>>>> On Fri, Sep 2, 2022 at 9:44 AM Satish Patel <satish.txt@xxxxxxxxx>
>>>>> wrote:
>>>>>
>>>>>> Hi Adam,
>>>>>>
>>>>>> In cephadm ls i found the following service but i believe it was
>>>>>> there before also.
>>>>>>
>>>>>> {
>>>>>>         "style": "cephadm:v1",
>>>>>>         "name":
>>>>>> "cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d",
>>>>>>         "fsid": "f270ad9e-1f6f-11ed-b6f8-a539d87379ea",
>>>>>>         "systemd_unit":
>>>>>> "ceph-f270ad9e-1f6f-11ed-b6f8-a539d87379ea@cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d
>>>>>> ",
>>>>>>         "enabled": false,
>>>>>>         "state": "stopped",
>>>>>>         "container_id": null,
>>>>>>         "container_image_name": null,
>>>>>>         "container_image_id": null,
>>>>>>         "version": null,
>>>>>>         "started": null,
>>>>>>         "created": null,
>>>>>>         "deployed": null,
>>>>>>         "configured": null
>>>>>>     },
>>>>>>
>>>>>> Look like remove didn't work
>>>>>>
>>>>>> root@ceph1:~# ceph orch rm cephadm
>>>>>> Failed to remove service. <cephadm> was not found.
>>>>>>
>>>>>> root@ceph1:~# ceph orch rm
>>>>>> cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d
>>>>>> Failed to remove service.
>>>>>> <cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d>
>>>>>> was not found.
>>>>>>
>>>>>> On Fri, Sep 2, 2022 at 8:27 AM Adam King <adking@xxxxxxxxxx> wrote:
>>>>>>
>>>>>>> this looks like an old traceback you would get if you ended up with
>>>>>>> a service type that shouldn't be there somehow. The things I'd probably
>>>>>>> check are that "cephadm ls" on either host definitely doesn't report and
>>>>>>> strange things that aren't actually daemons in your cluster such as
>>>>>>> "cephadm.<hash>". Another thing you could maybe try, as I believe the
>>>>>>> assertion it's giving is for an unknown service type here ("AssertionError:
>>>>>>> cephadm"), is just "ceph orch rm cephadm" which would maybe cause it to
>>>>>>> remove whatever it thinks is this "cephadm" service that it has deployed.
>>>>>>> Lastly, you could try having the mgr you manually deploy be a 16.2.10 one
>>>>>>> instead of 15.2.17 (I'm assuming here, but the line numbers in that
>>>>>>> traceback suggest octopus). The 16.2.10 one is just much less likely to
>>>>>>> have a bug that causes something like this.
>>>>>>>
>>>>>>> On Fri, Sep 2, 2022 at 1:41 AM Satish Patel <satish.txt@xxxxxxxxx>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Now when I run "ceph orch ps" it works but the following command
>>>>>>>> throws an
>>>>>>>> error.  Trying to bring up second mgr using ceph orch apply mgr
>>>>>>>> command but
>>>>>>>> didn't help
>>>>>>>>
>>>>>>>> root@ceph1:/ceph-disk# ceph version
>>>>>>>> ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4)
>>>>>>>> octopus
>>>>>>>> (stable)
>>>>>>>>
>>>>>>>> root@ceph1:/ceph-disk# ceph orch ls
>>>>>>>> Error EINVAL: Traceback (most recent call last):
>>>>>>>>   File "/usr/share/ceph/mgr/mgr_module.py", line 1212, in
>>>>>>>> _handle_command
>>>>>>>>     return self.handle_command(inbuf, cmd)
>>>>>>>>   File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 140,
>>>>>>>> in
>>>>>>>> handle_command
>>>>>>>>     return dispatch[cmd['prefix']].call(self, cmd, inbuf)
>>>>>>>>   File "/usr/share/ceph/mgr/mgr_module.py", line 320, in call
>>>>>>>>     return self.func(mgr, **kwargs)
>>>>>>>>   File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 102,
>>>>>>>> in
>>>>>>>> <lambda>
>>>>>>>>     wrapper_copy = lambda *l_args, **l_kwargs: wrapper(*l_args,
>>>>>>>> **l_kwargs)
>>>>>>>>   File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 91,
>>>>>>>> in wrapper
>>>>>>>>     return func(*args, **kwargs)
>>>>>>>>   File "/usr/share/ceph/mgr/orchestrator/module.py", line 503, in
>>>>>>>> _list_services
>>>>>>>>     raise_if_exception(completion)
>>>>>>>>   File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 642,
>>>>>>>> in
>>>>>>>> raise_if_exception
>>>>>>>>     raise e
>>>>>>>> AssertionError: cephadm
>>>>>>>>
>>>>>>>> On Fri, Sep 2, 2022 at 1:32 AM Satish Patel <satish.txt@xxxxxxxxx>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> > nevermind, i found doc related that and i am able to get 1 mgr up
>>>>>>>> -
>>>>>>>> >
>>>>>>>> https://docs.ceph.com/en/quincy/cephadm/troubleshooting/#manually-deploying-a-mgr-daemon
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > On Fri, Sep 2, 2022 at 1:21 AM Satish Patel <satish.txt@xxxxxxxxx>
>>>>>>>> wrote:
>>>>>>>> >
>>>>>>>> >> Folks,
>>>>>>>> >>
>>>>>>>> >> I am having little fun time with cephadm and it's very annoying
>>>>>>>> to deal
>>>>>>>> >> with it
>>>>>>>> >>
>>>>>>>> >> I have deployed a ceph cluster using cephadm on two nodes. Now
>>>>>>>> when i was
>>>>>>>> >> trying to upgrade and noticed hiccups where it just upgraded a
>>>>>>>> single mgr
>>>>>>>> >> with 16.2.10 but not other so i started messing around and
>>>>>>>> somehow I
>>>>>>>> >> deleted both mgr in the thought that cephadm will recreate them.
>>>>>>>> >>
>>>>>>>> >> Now i don't have any single mgr so my ceph orch command hangs
>>>>>>>> forever and
>>>>>>>> >> looks like a chicken egg issue.
>>>>>>>> >>
>>>>>>>> >> How do I recover from this? If I can't run the ceph orch
>>>>>>>> command, I won't
>>>>>>>> >> be able to redeploy my mgr daemons.
>>>>>>>> >>
>>>>>>>> >> I am not able to find any mgr in the following command on both
>>>>>>>> nodes.
>>>>>>>> >>
>>>>>>>> >> $ cephadm ls | grep mgr
>>>>>>>> >>
>>>>>>>> >
>>>>>>>> _______________________________________________
>>>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>>>>>>
>>>>>>>>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx