Re: [cephadm] mgr: no daemons active

Satish Patel <satish.txt@xxxxxxxxx> · Fri, 2 Sep 2022 10:36:16 -0400

Hi Adam,

I run the following command to upgrade but it looks like nothing is
happening

$ ceph orch upgrade start --image quay.io/ceph/ceph:v16.2.10

Status message is empty..

root@ceph1:~# ceph orch upgrade status
{
    "target_image": "quay.io/ceph/ceph:v16.2.10",
    "in_progress": true,
    "services_complete": [],
    "message": ""
}

Nothing in Logs

root@ceph1:~# tail -f
/var/log/ceph/f270ad9e-1f6f-11ed-b6f8-a539d87379ea/ceph.cephadm.log
2022-09-02T14:31:52.597661+0000 mgr.ceph2.huidoh (mgr.344392) 174 : cephadm
[INF] refreshing ceph2 facts
2022-09-02T14:31:52.991450+0000 mgr.ceph2.huidoh (mgr.344392) 176 : cephadm
[INF] refreshing ceph1 facts
2022-09-02T14:32:52.965092+0000 mgr.ceph2.huidoh (mgr.344392) 207 : cephadm
[INF] refreshing ceph2 facts
2022-09-02T14:32:53.369789+0000 mgr.ceph2.huidoh (mgr.344392) 208 : cephadm
[INF] refreshing ceph1 facts
2022-09-02T14:33:53.367986+0000 mgr.ceph2.huidoh (mgr.344392) 239 : cephadm
[INF] refreshing ceph2 facts
2022-09-02T14:33:53.760427+0000 mgr.ceph2.huidoh (mgr.344392) 240 : cephadm
[INF] refreshing ceph1 facts
2022-09-02T14:34:53.754277+0000 mgr.ceph2.huidoh (mgr.344392) 272 : cephadm
[INF] refreshing ceph2 facts
2022-09-02T14:34:54.162503+0000 mgr.ceph2.huidoh (mgr.344392) 273 : cephadm
[INF] refreshing ceph1 facts
2022-09-02T14:35:54.133467+0000 mgr.ceph2.huidoh (mgr.344392) 305 : cephadm
[INF] refreshing ceph2 facts
2022-09-02T14:35:54.522171+0000 mgr.ceph2.huidoh (mgr.344392) 306 : cephadm
[INF] refreshing ceph1 facts

In progress that mesg stuck there for long time

root@ceph1:~# ceph -s
  cluster:
    id:     f270ad9e-1f6f-11ed-b6f8-a539d87379ea
    health: HEALTH_OK

  services:
    mon: 1 daemons, quorum ceph1 (age 9h)
    mgr: ceph2.huidoh(active, since 9m), standbys: ceph1.smfvfd
    osd: 4 osds: 4 up (since 9h), 4 in (since 11h)

  data:
    pools:   5 pools, 129 pgs
    objects: 20.06k objects, 83 GiB
    usage:   168 GiB used, 632 GiB / 800 GiB avail
    pgs:     129 active+clean

  io:
    client:   12 KiB/s wr, 0 op/s rd, 1 op/s wr

  progress:
    Upgrade to quay.io/ceph/ceph:v16.2.10 (0s)
      [............................]

On Fri, Sep 2, 2022 at 10:25 AM Satish Patel <satish.txt@xxxxxxxxx> wrote:

> It Looks like I did it with the following command.
>
> $ ceph orch daemon add mgr ceph2:10.73.0.192
>
> Now i can see two with same version 15.x
>
> root@ceph1:~# ceph orch ps --daemon-type mgr
> NAME              HOST   STATUS         REFRESHED  AGE  VERSION  IMAGE
> NAME
>           IMAGE ID      CONTAINER ID
> mgr.ceph1.smfvfd  ceph1  running (8h)   41s ago    8h   15.2.17
> quay.io/ceph/ceph@sha256:c08064dde4bba4e72a1f55d90ca32df9ef5aafab82efe2e0a0722444a5aaacca
>  93146564743f  1aab837306d2
> mgr.ceph2.huidoh  ceph2  running (60s)  110s ago   60s  15.2.17
> quay.io/ceph/ceph@sha256:c08064dde4bba4e72a1f55d90ca32df9ef5aafab82efe2e0a0722444a5aaacca
>  93146564743f  294fd6ab6c97
>
> On Fri, Sep 2, 2022 at 10:19 AM Satish Patel <satish.txt@xxxxxxxxx> wrote:
>
>> Let's come back to the original question: how to bring back the second
>> mgr?
>>
>> root@ceph1:~# ceph orch apply mgr 2
>> Scheduled mgr update...
>>
>> Nothing happened with above command, logs saying nothing
>>
>> 2022-09-02T14:16:20.407927+0000 mgr.ceph1.smfvfd (mgr.334626) 16939 :
>> cephadm [INF] refreshing ceph2 facts
>> 2022-09-02T14:16:40.247195+0000 mgr.ceph1.smfvfd (mgr.334626) 16952 :
>> cephadm [INF] Saving service mgr spec with placement count:2
>> 2022-09-02T14:16:53.106919+0000 mgr.ceph1.smfvfd (mgr.334626) 16961 :
>> cephadm [INF] Saving service mgr spec with placement count:2
>> 2022-09-02T14:17:19.135203+0000 mgr.ceph1.smfvfd (mgr.334626) 16975 :
>> cephadm [INF] refreshing ceph1 facts
>> 2022-09-02T14:17:20.780496+0000 mgr.ceph1.smfvfd (mgr.334626) 16977 :
>> cephadm [INF] refreshing ceph2 facts
>> 2022-09-02T14:18:19.502034+0000 mgr.ceph1.smfvfd (mgr.334626) 17008 :
>> cephadm [INF] refreshing ceph1 facts
>> 2022-09-02T14:18:21.127973+0000 mgr.ceph1.smfvfd (mgr.334626) 17010 :
>> cephadm [INF] refreshing ceph2 facts
>>
>>
>>
>>
>>
>>
>>
>> On Fri, Sep 2, 2022 at 10:15 AM Satish Patel <satish.txt@xxxxxxxxx>
>> wrote:
>>
>>> Hi Adam,
>>>
>>> Wait..wait.. now it's working suddenly without doing anything.. very odd
>>>
>>> root@ceph1:~# ceph orch ls
>>> NAME                  RUNNING  REFRESHED  AGE  PLACEMENT    IMAGE NAME
>>>
>>>   IMAGE ID
>>> alertmanager              1/1  5s ago     2w   count:1
>>> quay.io/prometheus/alertmanager:v0.20.0
>>>                    0881eb8f169f
>>> crash                     2/2  5s ago     2w   *
>>> quay.io/ceph/ceph:v15
>>>                    93146564743f
>>> grafana                   1/1  5s ago     2w   count:1
>>> quay.io/ceph/ceph-grafana:6.7.4
>>>                    557c83e11646
>>> mgr                       1/2  5s ago     8h   count:2
>>> quay.io/ceph/ceph@sha256:c08064dde4bba4e72a1f55d90ca32df9ef5aafab82efe2e0a0722444a5aaacca
>>>  93146564743f
>>> mon                       1/2  5s ago     8h   ceph1;ceph2
>>> quay.io/ceph/ceph:v15
>>>                    93146564743f
>>> node-exporter             2/2  5s ago     2w   *
>>> quay.io/prometheus/node-exporter:v0.18.1
>>>                     e5a616e4b9cf
>>> osd.osd_spec_default      4/0  5s ago     -    <unmanaged>
>>> quay.io/ceph/ceph:v15
>>>                    93146564743f
>>> prometheus                1/1  5s ago     2w   count:1
>>> quay.io/prometheus/prometheus:v2.18.1
>>>
>>> On Fri, Sep 2, 2022 at 10:13 AM Satish Patel <satish.txt@xxxxxxxxx>
>>> wrote:
>>>
>>>> I can see that in the output but I'm not sure how to get rid of it.
>>>>
>>>> root@ceph1:~# ceph orch ps --refresh
>>>> NAME
>>>>    HOST   STATUS        REFRESHED  AGE  VERSION    IMAGE NAME
>>>>                                                                   IMAGE ID
>>>>      CONTAINER ID
>>>> alertmanager.ceph1
>>>>    ceph1  running (9h)  64s ago    2w   0.20.0
>>>> quay.io/prometheus/alertmanager:v0.20.0
>>>>                      0881eb8f169f  ba804b555378
>>>> cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d
>>>>  ceph2  stopped       65s ago    -    <unknown>  <unknown>
>>>>                                                                  <unknown>
>>>>     <unknown>
>>>> crash.ceph1
>>>>   ceph1  running (9h)  64s ago    2w   15.2.17    quay.io/ceph/ceph:v15
>>>>
>>>>  93146564743f  a3a431d834fc
>>>> crash.ceph2
>>>>   ceph2  running (9h)  65s ago    13d  15.2.17    quay.io/ceph/ceph:v15
>>>>
>>>>  93146564743f  3c963693ff2b
>>>> grafana.ceph1
>>>>   ceph1  running (9h)  64s ago    2w   6.7.4
>>>> quay.io/ceph/ceph-grafana:6.7.4
>>>>                      557c83e11646  7583a8dc4c61
>>>> mgr.ceph1.smfvfd
>>>>    ceph1  running (8h)  64s ago    8h   15.2.17
>>>> quay.io/ceph/ceph@sha256:c08064dde4bba4e72a1f55d90ca32df9ef5aafab82efe2e0a0722444a5aaacca
>>>>  93146564743f  1aab837306d2
>>>> mon.ceph1
>>>>   ceph1  running (9h)  64s ago    2w   15.2.17    quay.io/ceph/ceph:v15
>>>>
>>>>  93146564743f  c1d155d8c7ad
>>>> node-exporter.ceph1
>>>>   ceph1  running (9h)  64s ago    2w   0.18.1
>>>> quay.io/prometheus/node-exporter:v0.18.1
>>>>                     e5a616e4b9cf  2ff235fe0e42
>>>> node-exporter.ceph2
>>>>   ceph2  running (9h)  65s ago    13d  0.18.1
>>>> quay.io/prometheus/node-exporter:v0.18.1
>>>>                     e5a616e4b9cf  17678b9ba602
>>>> osd.0
>>>>   ceph1  running (9h)  64s ago    13d  15.2.17    quay.io/ceph/ceph:v15
>>>>
>>>>  93146564743f  d0fd73b777a3
>>>> osd.1
>>>>   ceph1  running (9h)  64s ago    13d  15.2.17    quay.io/ceph/ceph:v15
>>>>
>>>>  93146564743f  049120e83102
>>>> osd.2
>>>>   ceph2  running (9h)  65s ago    13d  15.2.17    quay.io/ceph/ceph:v15
>>>>
>>>>  93146564743f  8700e8cefd1f
>>>> osd.3
>>>>   ceph2  running (9h)  65s ago    13d  15.2.17    quay.io/ceph/ceph:v15
>>>>
>>>>  93146564743f  9c71bc87ed16
>>>> prometheus.ceph1
>>>>    ceph1  running (9h)  64s ago    2w   2.18.1
>>>> quay.io/prometheus/prometheus:v2.18.1
>>>>                      de242295e225  74a538efd61e
>>>>
>>>> On Fri, Sep 2, 2022 at 10:10 AM Adam King <adking@xxxxxxxxxx> wrote:
>>>>
>>>>> maybe also a "ceph orch ps --refresh"? It might still have the old
>>>>> cached daemon inventory from before you remove the files.
>>>>>
>>>>> On Fri, Sep 2, 2022 at 9:57 AM Satish Patel <satish.txt@xxxxxxxxx>
>>>>> wrote:
>>>>>
>>>>>> Hi Adam,
>>>>>>
>>>>>> I have deleted file located here - rm
>>>>>> /var/lib/ceph/f270ad9e-1f6f-11ed-b6f8-a539d87379ea/cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d
>>>>>>
>>>>>> But still getting the same error, do i need to do anything else?
>>>>>>
>>>>>> On Fri, Sep 2, 2022 at 9:51 AM Adam King <adking@xxxxxxxxxx> wrote:
>>>>>>
>>>>>>> Okay, I'm wondering if this is an issue with version mismatch.
>>>>>>> Having previously had a 16.2.10 mgr and then now having a 15.2.17 one that
>>>>>>> doesn't expect this sort of thing to be present. Either way, I'd think just
>>>>>>> deleting this cephadm.7ce656a8721deb5054c37b0cfb9038
>>>>>>> 1522d521dde51fb0c5a2142314d663f63d (and any others like it) file
>>>>>>> would be the way forward to get orch ls working again.
>>>>>>>
>>>>>>> On Fri, Sep 2, 2022 at 9:44 AM Satish Patel <satish.txt@xxxxxxxxx>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Adam,
>>>>>>>>
>>>>>>>> In cephadm ls i found the following service but i believe it was
>>>>>>>> there before also.
>>>>>>>>
>>>>>>>> {
>>>>>>>>         "style": "cephadm:v1",
>>>>>>>>         "name":
>>>>>>>> "cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d",
>>>>>>>>         "fsid": "f270ad9e-1f6f-11ed-b6f8-a539d87379ea",
>>>>>>>>         "systemd_unit":
>>>>>>>> "ceph-f270ad9e-1f6f-11ed-b6f8-a539d87379ea@cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d
>>>>>>>> ",
>>>>>>>>         "enabled": false,
>>>>>>>>         "state": "stopped",
>>>>>>>>         "container_id": null,
>>>>>>>>         "container_image_name": null,
>>>>>>>>         "container_image_id": null,
>>>>>>>>         "version": null,
>>>>>>>>         "started": null,
>>>>>>>>         "created": null,
>>>>>>>>         "deployed": null,
>>>>>>>>         "configured": null
>>>>>>>>     },
>>>>>>>>
>>>>>>>> Look like remove didn't work
>>>>>>>>
>>>>>>>> root@ceph1:~# ceph orch rm cephadm
>>>>>>>> Failed to remove service. <cephadm> was not found.
>>>>>>>>
>>>>>>>> root@ceph1:~# ceph orch rm
>>>>>>>> cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d
>>>>>>>> Failed to remove service.
>>>>>>>> <cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d>
>>>>>>>> was not found.
>>>>>>>>
>>>>>>>> On Fri, Sep 2, 2022 at 8:27 AM Adam King <adking@xxxxxxxxxx> wrote:
>>>>>>>>
>>>>>>>>> this looks like an old traceback you would get if you ended up
>>>>>>>>> with a service type that shouldn't be there somehow. The things I'd
>>>>>>>>> probably check are that "cephadm ls" on either host definitely doesn't
>>>>>>>>> report and strange things that aren't actually daemons in your cluster such
>>>>>>>>> as "cephadm.<hash>". Another thing you could maybe try, as I believe the
>>>>>>>>> assertion it's giving is for an unknown service type here ("AssertionError:
>>>>>>>>> cephadm"), is just "ceph orch rm cephadm" which would maybe cause it to
>>>>>>>>> remove whatever it thinks is this "cephadm" service that it has deployed.
>>>>>>>>> Lastly, you could try having the mgr you manually deploy be a 16.2.10 one
>>>>>>>>> instead of 15.2.17 (I'm assuming here, but the line numbers in that
>>>>>>>>> traceback suggest octopus). The 16.2.10 one is just much less likely to
>>>>>>>>> have a bug that causes something like this.
>>>>>>>>>
>>>>>>>>> On Fri, Sep 2, 2022 at 1:41 AM Satish Patel <satish.txt@xxxxxxxxx>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Now when I run "ceph orch ps" it works but the following command
>>>>>>>>>> throws an
>>>>>>>>>> error.  Trying to bring up second mgr using ceph orch apply mgr
>>>>>>>>>> command but
>>>>>>>>>> didn't help
>>>>>>>>>>
>>>>>>>>>> root@ceph1:/ceph-disk# ceph version
>>>>>>>>>> ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4)
>>>>>>>>>> octopus
>>>>>>>>>> (stable)
>>>>>>>>>>
>>>>>>>>>> root@ceph1:/ceph-disk# ceph orch ls
>>>>>>>>>> Error EINVAL: Traceback (most recent call last):
>>>>>>>>>>   File "/usr/share/ceph/mgr/mgr_module.py", line 1212, in
>>>>>>>>>> _handle_command
>>>>>>>>>>     return self.handle_command(inbuf, cmd)
>>>>>>>>>>   File "/usr/share/ceph/mgr/orchestrator/_interface.py", line
>>>>>>>>>> 140, in
>>>>>>>>>> handle_command
>>>>>>>>>>     return dispatch[cmd['prefix']].call(self, cmd, inbuf)
>>>>>>>>>>   File "/usr/share/ceph/mgr/mgr_module.py", line 320, in call
>>>>>>>>>>     return self.func(mgr, **kwargs)
>>>>>>>>>>   File "/usr/share/ceph/mgr/orchestrator/_interface.py", line
>>>>>>>>>> 102, in
>>>>>>>>>> <lambda>
>>>>>>>>>>     wrapper_copy = lambda *l_args, **l_kwargs: wrapper(*l_args,
>>>>>>>>>> **l_kwargs)
>>>>>>>>>>   File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 91,
>>>>>>>>>> in wrapper
>>>>>>>>>>     return func(*args, **kwargs)
>>>>>>>>>>   File "/usr/share/ceph/mgr/orchestrator/module.py", line 503, in
>>>>>>>>>> _list_services
>>>>>>>>>>     raise_if_exception(completion)
>>>>>>>>>>   File "/usr/share/ceph/mgr/orchestrator/_interface.py", line
>>>>>>>>>> 642, in
>>>>>>>>>> raise_if_exception
>>>>>>>>>>     raise e
>>>>>>>>>> AssertionError: cephadm
>>>>>>>>>>
>>>>>>>>>> On Fri, Sep 2, 2022 at 1:32 AM Satish Patel <satish.txt@xxxxxxxxx>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> > nevermind, i found doc related that and i am able to get 1 mgr
>>>>>>>>>> up -
>>>>>>>>>> >
>>>>>>>>>> https://docs.ceph.com/en/quincy/cephadm/troubleshooting/#manually-deploying-a-mgr-daemon
>>>>>>>>>> >
>>>>>>>>>> >
>>>>>>>>>> > On Fri, Sep 2, 2022 at 1:21 AM Satish Patel <
>>>>>>>>>> satish.txt@xxxxxxxxx> wrote:
>>>>>>>>>> >
>>>>>>>>>> >> Folks,
>>>>>>>>>> >>
>>>>>>>>>> >> I am having little fun time with cephadm and it's very
>>>>>>>>>> annoying to deal
>>>>>>>>>> >> with it
>>>>>>>>>> >>
>>>>>>>>>> >> I have deployed a ceph cluster using cephadm on two nodes. Now
>>>>>>>>>> when i was
>>>>>>>>>> >> trying to upgrade and noticed hiccups where it just upgraded a
>>>>>>>>>> single mgr
>>>>>>>>>> >> with 16.2.10 but not other so i started messing around and
>>>>>>>>>> somehow I
>>>>>>>>>> >> deleted both mgr in the thought that cephadm will recreate
>>>>>>>>>> them.
>>>>>>>>>> >>
>>>>>>>>>> >> Now i don't have any single mgr so my ceph orch command hangs
>>>>>>>>>> forever and
>>>>>>>>>> >> looks like a chicken egg issue.
>>>>>>>>>> >>
>>>>>>>>>> >> How do I recover from this? If I can't run the ceph orch
>>>>>>>>>> command, I won't
>>>>>>>>>> >> be able to redeploy my mgr daemons.
>>>>>>>>>> >>
>>>>>>>>>> >> I am not able to find any mgr in the following command on both
>>>>>>>>>> nodes.
>>>>>>>>>> >>
>>>>>>>>>> >> $ cephadm ls | grep mgr
>>>>>>>>>> >>
>>>>>>>>>> >
>>>>>>>>>> _______________________________________________
>>>>>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>>>>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>>>>>>>>
>>>>>>>>>>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx