Re: [cephadm] mgr: no daemons active

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



hmm, at this point, maybe we should just try manually upgrading the mgr
daemons and then move from there. First, just stop the upgrade "ceph orch
upgrade stop". If you figure out which of the two mgr daemons is the
standby (it should say which one is active in "ceph -s" output) and then do
a "ceph orch daemon redeploy <standby-mgr-name> quay.io/ceph/ceph:v16.2.10"
it should redeploy that specific mgr with the new version. You could then
do a "ceph mgr fail" to swap which of the mgr daemons is active, then do
another "ceph orch daemon redeploy <standby-mgr-name>
quay.io/ceph/ceph:v16.2.10" where the standby is now the other mgr still on
15.2.17. Once the mgr daemons are both upgraded to the new version, run a
"ceph orch redeploy mgr" and then "ceph orch upgrade start --image
quay.io/ceph/ceph:v16.2.10" and see if it goes better.

On Fri, Sep 2, 2022 at 10:36 AM Satish Patel <satish.txt@xxxxxxxxx> wrote:

> Hi Adam,
>
> I run the following command to upgrade but it looks like nothing is
> happening
>
> $ ceph orch upgrade start --image quay.io/ceph/ceph:v16.2.10
>
> Status message is empty..
>
> root@ceph1:~# ceph orch upgrade status
> {
>     "target_image": "quay.io/ceph/ceph:v16.2.10",
>     "in_progress": true,
>     "services_complete": [],
>     "message": ""
> }
>
> Nothing in Logs
>
> root@ceph1:~# tail -f
> /var/log/ceph/f270ad9e-1f6f-11ed-b6f8-a539d87379ea/ceph.cephadm.log
> 2022-09-02T14:31:52.597661+0000 mgr.ceph2.huidoh (mgr.344392) 174 :
> cephadm [INF] refreshing ceph2 facts
> 2022-09-02T14:31:52.991450+0000 mgr.ceph2.huidoh (mgr.344392) 176 :
> cephadm [INF] refreshing ceph1 facts
> 2022-09-02T14:32:52.965092+0000 mgr.ceph2.huidoh (mgr.344392) 207 :
> cephadm [INF] refreshing ceph2 facts
> 2022-09-02T14:32:53.369789+0000 mgr.ceph2.huidoh (mgr.344392) 208 :
> cephadm [INF] refreshing ceph1 facts
> 2022-09-02T14:33:53.367986+0000 mgr.ceph2.huidoh (mgr.344392) 239 :
> cephadm [INF] refreshing ceph2 facts
> 2022-09-02T14:33:53.760427+0000 mgr.ceph2.huidoh (mgr.344392) 240 :
> cephadm [INF] refreshing ceph1 facts
> 2022-09-02T14:34:53.754277+0000 mgr.ceph2.huidoh (mgr.344392) 272 :
> cephadm [INF] refreshing ceph2 facts
> 2022-09-02T14:34:54.162503+0000 mgr.ceph2.huidoh (mgr.344392) 273 :
> cephadm [INF] refreshing ceph1 facts
> 2022-09-02T14:35:54.133467+0000 mgr.ceph2.huidoh (mgr.344392) 305 :
> cephadm [INF] refreshing ceph2 facts
> 2022-09-02T14:35:54.522171+0000 mgr.ceph2.huidoh (mgr.344392) 306 :
> cephadm [INF] refreshing ceph1 facts
>
> In progress that mesg stuck there for long time
>
> root@ceph1:~# ceph -s
>   cluster:
>     id:     f270ad9e-1f6f-11ed-b6f8-a539d87379ea
>     health: HEALTH_OK
>
>   services:
>     mon: 1 daemons, quorum ceph1 (age 9h)
>     mgr: ceph2.huidoh(active, since 9m), standbys: ceph1.smfvfd
>     osd: 4 osds: 4 up (since 9h), 4 in (since 11h)
>
>   data:
>     pools:   5 pools, 129 pgs
>     objects: 20.06k objects, 83 GiB
>     usage:   168 GiB used, 632 GiB / 800 GiB avail
>     pgs:     129 active+clean
>
>   io:
>     client:   12 KiB/s wr, 0 op/s rd, 1 op/s wr
>
>   progress:
>     Upgrade to quay.io/ceph/ceph:v16.2.10 (0s)
>       [............................]
>
> On Fri, Sep 2, 2022 at 10:25 AM Satish Patel <satish.txt@xxxxxxxxx> wrote:
>
>> It Looks like I did it with the following command.
>>
>> $ ceph orch daemon add mgr ceph2:10.73.0.192
>>
>> Now i can see two with same version 15.x
>>
>> root@ceph1:~# ceph orch ps --daemon-type mgr
>> NAME              HOST   STATUS         REFRESHED  AGE  VERSION  IMAGE
>> NAME
>>           IMAGE ID      CONTAINER ID
>> mgr.ceph1.smfvfd  ceph1  running (8h)   41s ago    8h   15.2.17
>> quay.io/ceph/ceph@sha256:c08064dde4bba4e72a1f55d90ca32df9ef5aafab82efe2e0a0722444a5aaacca
>>  93146564743f  1aab837306d2
>> mgr.ceph2.huidoh  ceph2  running (60s)  110s ago   60s  15.2.17
>> quay.io/ceph/ceph@sha256:c08064dde4bba4e72a1f55d90ca32df9ef5aafab82efe2e0a0722444a5aaacca
>>  93146564743f  294fd6ab6c97
>>
>> On Fri, Sep 2, 2022 at 10:19 AM Satish Patel <satish.txt@xxxxxxxxx>
>> wrote:
>>
>>> Let's come back to the original question: how to bring back the second
>>> mgr?
>>>
>>> root@ceph1:~# ceph orch apply mgr 2
>>> Scheduled mgr update...
>>>
>>> Nothing happened with above command, logs saying nothing
>>>
>>> 2022-09-02T14:16:20.407927+0000 mgr.ceph1.smfvfd (mgr.334626) 16939 :
>>> cephadm [INF] refreshing ceph2 facts
>>> 2022-09-02T14:16:40.247195+0000 mgr.ceph1.smfvfd (mgr.334626) 16952 :
>>> cephadm [INF] Saving service mgr spec with placement count:2
>>> 2022-09-02T14:16:53.106919+0000 mgr.ceph1.smfvfd (mgr.334626) 16961 :
>>> cephadm [INF] Saving service mgr spec with placement count:2
>>> 2022-09-02T14:17:19.135203+0000 mgr.ceph1.smfvfd (mgr.334626) 16975 :
>>> cephadm [INF] refreshing ceph1 facts
>>> 2022-09-02T14:17:20.780496+0000 mgr.ceph1.smfvfd (mgr.334626) 16977 :
>>> cephadm [INF] refreshing ceph2 facts
>>> 2022-09-02T14:18:19.502034+0000 mgr.ceph1.smfvfd (mgr.334626) 17008 :
>>> cephadm [INF] refreshing ceph1 facts
>>> 2022-09-02T14:18:21.127973+0000 mgr.ceph1.smfvfd (mgr.334626) 17010 :
>>> cephadm [INF] refreshing ceph2 facts
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Fri, Sep 2, 2022 at 10:15 AM Satish Patel <satish.txt@xxxxxxxxx>
>>> wrote:
>>>
>>>> Hi Adam,
>>>>
>>>> Wait..wait.. now it's working suddenly without doing anything.. very odd
>>>>
>>>> root@ceph1:~# ceph orch ls
>>>> NAME                  RUNNING  REFRESHED  AGE  PLACEMENT    IMAGE NAME
>>>>
>>>>     IMAGE ID
>>>> alertmanager              1/1  5s ago     2w   count:1
>>>> quay.io/prometheus/alertmanager:v0.20.0
>>>>                      0881eb8f169f
>>>> crash                     2/2  5s ago     2w   *
>>>> quay.io/ceph/ceph:v15
>>>>                      93146564743f
>>>> grafana                   1/1  5s ago     2w   count:1
>>>> quay.io/ceph/ceph-grafana:6.7.4
>>>>                      557c83e11646
>>>> mgr                       1/2  5s ago     8h   count:2
>>>> quay.io/ceph/ceph@sha256:c08064dde4bba4e72a1f55d90ca32df9ef5aafab82efe2e0a0722444a5aaacca
>>>>  93146564743f
>>>> mon                       1/2  5s ago     8h   ceph1;ceph2
>>>> quay.io/ceph/ceph:v15
>>>>                      93146564743f
>>>> node-exporter             2/2  5s ago     2w   *
>>>> quay.io/prometheus/node-exporter:v0.18.1
>>>>                     e5a616e4b9cf
>>>> osd.osd_spec_default      4/0  5s ago     -    <unmanaged>
>>>> quay.io/ceph/ceph:v15
>>>>                      93146564743f
>>>> prometheus                1/1  5s ago     2w   count:1
>>>> quay.io/prometheus/prometheus:v2.18.1
>>>>
>>>> On Fri, Sep 2, 2022 at 10:13 AM Satish Patel <satish.txt@xxxxxxxxx>
>>>> wrote:
>>>>
>>>>> I can see that in the output but I'm not sure how to get rid of it.
>>>>>
>>>>> root@ceph1:~# ceph orch ps --refresh
>>>>> NAME
>>>>>    HOST   STATUS        REFRESHED  AGE  VERSION    IMAGE NAME
>>>>>                                                                   IMAGE ID
>>>>>      CONTAINER ID
>>>>> alertmanager.ceph1
>>>>>    ceph1  running (9h)  64s ago    2w   0.20.0
>>>>> quay.io/prometheus/alertmanager:v0.20.0
>>>>>                      0881eb8f169f  ba804b555378
>>>>> cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d
>>>>>  ceph2  stopped       65s ago    -    <unknown>  <unknown>
>>>>>                                                                  <unknown>
>>>>>     <unknown>
>>>>> crash.ceph1
>>>>>     ceph1  running (9h)  64s ago    2w   15.2.17
>>>>> quay.io/ceph/ceph:v15
>>>>>                      93146564743f  a3a431d834fc
>>>>> crash.ceph2
>>>>>     ceph2  running (9h)  65s ago    13d  15.2.17
>>>>> quay.io/ceph/ceph:v15
>>>>>                      93146564743f  3c963693ff2b
>>>>> grafana.ceph1
>>>>>     ceph1  running (9h)  64s ago    2w   6.7.4
>>>>> quay.io/ceph/ceph-grafana:6.7.4
>>>>>                      557c83e11646  7583a8dc4c61
>>>>> mgr.ceph1.smfvfd
>>>>>    ceph1  running (8h)  64s ago    8h   15.2.17
>>>>> quay.io/ceph/ceph@sha256:c08064dde4bba4e72a1f55d90ca32df9ef5aafab82efe2e0a0722444a5aaacca
>>>>>  93146564743f  1aab837306d2
>>>>> mon.ceph1
>>>>>     ceph1  running (9h)  64s ago    2w   15.2.17
>>>>> quay.io/ceph/ceph:v15
>>>>>                      93146564743f  c1d155d8c7ad
>>>>> node-exporter.ceph1
>>>>>     ceph1  running (9h)  64s ago    2w   0.18.1
>>>>> quay.io/prometheus/node-exporter:v0.18.1
>>>>>                       e5a616e4b9cf  2ff235fe0e42
>>>>> node-exporter.ceph2
>>>>>     ceph2  running (9h)  65s ago    13d  0.18.1
>>>>> quay.io/prometheus/node-exporter:v0.18.1
>>>>>                       e5a616e4b9cf  17678b9ba602
>>>>> osd.0
>>>>>     ceph1  running (9h)  64s ago    13d  15.2.17
>>>>> quay.io/ceph/ceph:v15
>>>>>                      93146564743f  d0fd73b777a3
>>>>> osd.1
>>>>>     ceph1  running (9h)  64s ago    13d  15.2.17
>>>>> quay.io/ceph/ceph:v15
>>>>>                      93146564743f  049120e83102
>>>>> osd.2
>>>>>     ceph2  running (9h)  65s ago    13d  15.2.17
>>>>> quay.io/ceph/ceph:v15
>>>>>                      93146564743f  8700e8cefd1f
>>>>> osd.3
>>>>>     ceph2  running (9h)  65s ago    13d  15.2.17
>>>>> quay.io/ceph/ceph:v15
>>>>>                      93146564743f  9c71bc87ed16
>>>>> prometheus.ceph1
>>>>>    ceph1  running (9h)  64s ago    2w   2.18.1
>>>>> quay.io/prometheus/prometheus:v2.18.1
>>>>>                      de242295e225  74a538efd61e
>>>>>
>>>>> On Fri, Sep 2, 2022 at 10:10 AM Adam King <adking@xxxxxxxxxx> wrote:
>>>>>
>>>>>> maybe also a "ceph orch ps --refresh"? It might still have the old
>>>>>> cached daemon inventory from before you remove the files.
>>>>>>
>>>>>> On Fri, Sep 2, 2022 at 9:57 AM Satish Patel <satish.txt@xxxxxxxxx>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Adam,
>>>>>>>
>>>>>>> I have deleted file located here - rm
>>>>>>> /var/lib/ceph/f270ad9e-1f6f-11ed-b6f8-a539d87379ea/cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d
>>>>>>>
>>>>>>> But still getting the same error, do i need to do anything else?
>>>>>>>
>>>>>>> On Fri, Sep 2, 2022 at 9:51 AM Adam King <adking@xxxxxxxxxx> wrote:
>>>>>>>
>>>>>>>> Okay, I'm wondering if this is an issue with version mismatch.
>>>>>>>> Having previously had a 16.2.10 mgr and then now having a 15.2.17 one that
>>>>>>>> doesn't expect this sort of thing to be present. Either way, I'd think just
>>>>>>>> deleting this cephadm.7ce656a8721deb5054c37b0cfb9038
>>>>>>>> 1522d521dde51fb0c5a2142314d663f63d (and any others like it) file
>>>>>>>> would be the way forward to get orch ls working again.
>>>>>>>>
>>>>>>>> On Fri, Sep 2, 2022 at 9:44 AM Satish Patel <satish.txt@xxxxxxxxx>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Adam,
>>>>>>>>>
>>>>>>>>> In cephadm ls i found the following service but i believe it was
>>>>>>>>> there before also.
>>>>>>>>>
>>>>>>>>> {
>>>>>>>>>         "style": "cephadm:v1",
>>>>>>>>>         "name":
>>>>>>>>> "cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d",
>>>>>>>>>         "fsid": "f270ad9e-1f6f-11ed-b6f8-a539d87379ea",
>>>>>>>>>         "systemd_unit":
>>>>>>>>> "ceph-f270ad9e-1f6f-11ed-b6f8-a539d87379ea@cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d
>>>>>>>>> ",
>>>>>>>>>         "enabled": false,
>>>>>>>>>         "state": "stopped",
>>>>>>>>>         "container_id": null,
>>>>>>>>>         "container_image_name": null,
>>>>>>>>>         "container_image_id": null,
>>>>>>>>>         "version": null,
>>>>>>>>>         "started": null,
>>>>>>>>>         "created": null,
>>>>>>>>>         "deployed": null,
>>>>>>>>>         "configured": null
>>>>>>>>>     },
>>>>>>>>>
>>>>>>>>> Look like remove didn't work
>>>>>>>>>
>>>>>>>>> root@ceph1:~# ceph orch rm cephadm
>>>>>>>>> Failed to remove service. <cephadm> was not found.
>>>>>>>>>
>>>>>>>>> root@ceph1:~# ceph orch rm
>>>>>>>>> cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d
>>>>>>>>> Failed to remove service.
>>>>>>>>> <cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d>
>>>>>>>>> was not found.
>>>>>>>>>
>>>>>>>>> On Fri, Sep 2, 2022 at 8:27 AM Adam King <adking@xxxxxxxxxx>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> this looks like an old traceback you would get if you ended up
>>>>>>>>>> with a service type that shouldn't be there somehow. The things I'd
>>>>>>>>>> probably check are that "cephadm ls" on either host definitely doesn't
>>>>>>>>>> report and strange things that aren't actually daemons in your cluster such
>>>>>>>>>> as "cephadm.<hash>". Another thing you could maybe try, as I believe the
>>>>>>>>>> assertion it's giving is for an unknown service type here ("AssertionError:
>>>>>>>>>> cephadm"), is just "ceph orch rm cephadm" which would maybe cause it to
>>>>>>>>>> remove whatever it thinks is this "cephadm" service that it has deployed.
>>>>>>>>>> Lastly, you could try having the mgr you manually deploy be a 16.2.10 one
>>>>>>>>>> instead of 15.2.17 (I'm assuming here, but the line numbers in that
>>>>>>>>>> traceback suggest octopus). The 16.2.10 one is just much less likely to
>>>>>>>>>> have a bug that causes something like this.
>>>>>>>>>>
>>>>>>>>>> On Fri, Sep 2, 2022 at 1:41 AM Satish Patel <satish.txt@xxxxxxxxx>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Now when I run "ceph orch ps" it works but the following command
>>>>>>>>>>> throws an
>>>>>>>>>>> error.  Trying to bring up second mgr using ceph orch apply mgr
>>>>>>>>>>> command but
>>>>>>>>>>> didn't help
>>>>>>>>>>>
>>>>>>>>>>> root@ceph1:/ceph-disk# ceph version
>>>>>>>>>>> ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4)
>>>>>>>>>>> octopus
>>>>>>>>>>> (stable)
>>>>>>>>>>>
>>>>>>>>>>> root@ceph1:/ceph-disk# ceph orch ls
>>>>>>>>>>> Error EINVAL: Traceback (most recent call last):
>>>>>>>>>>>   File "/usr/share/ceph/mgr/mgr_module.py", line 1212, in
>>>>>>>>>>> _handle_command
>>>>>>>>>>>     return self.handle_command(inbuf, cmd)
>>>>>>>>>>>   File "/usr/share/ceph/mgr/orchestrator/_interface.py", line
>>>>>>>>>>> 140, in
>>>>>>>>>>> handle_command
>>>>>>>>>>>     return dispatch[cmd['prefix']].call(self, cmd, inbuf)
>>>>>>>>>>>   File "/usr/share/ceph/mgr/mgr_module.py", line 320, in call
>>>>>>>>>>>     return self.func(mgr, **kwargs)
>>>>>>>>>>>   File "/usr/share/ceph/mgr/orchestrator/_interface.py", line
>>>>>>>>>>> 102, in
>>>>>>>>>>> <lambda>
>>>>>>>>>>>     wrapper_copy = lambda *l_args, **l_kwargs: wrapper(*l_args,
>>>>>>>>>>> **l_kwargs)
>>>>>>>>>>>   File "/usr/share/ceph/mgr/orchestrator/_interface.py", line
>>>>>>>>>>> 91, in wrapper
>>>>>>>>>>>     return func(*args, **kwargs)
>>>>>>>>>>>   File "/usr/share/ceph/mgr/orchestrator/module.py", line 503, in
>>>>>>>>>>> _list_services
>>>>>>>>>>>     raise_if_exception(completion)
>>>>>>>>>>>   File "/usr/share/ceph/mgr/orchestrator/_interface.py", line
>>>>>>>>>>> 642, in
>>>>>>>>>>> raise_if_exception
>>>>>>>>>>>     raise e
>>>>>>>>>>> AssertionError: cephadm
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Sep 2, 2022 at 1:32 AM Satish Patel <
>>>>>>>>>>> satish.txt@xxxxxxxxx> wrote:
>>>>>>>>>>>
>>>>>>>>>>> > nevermind, i found doc related that and i am able to get 1 mgr
>>>>>>>>>>> up -
>>>>>>>>>>> >
>>>>>>>>>>> https://docs.ceph.com/en/quincy/cephadm/troubleshooting/#manually-deploying-a-mgr-daemon
>>>>>>>>>>> >
>>>>>>>>>>> >
>>>>>>>>>>> > On Fri, Sep 2, 2022 at 1:21 AM Satish Patel <
>>>>>>>>>>> satish.txt@xxxxxxxxx> wrote:
>>>>>>>>>>> >
>>>>>>>>>>> >> Folks,
>>>>>>>>>>> >>
>>>>>>>>>>> >> I am having little fun time with cephadm and it's very
>>>>>>>>>>> annoying to deal
>>>>>>>>>>> >> with it
>>>>>>>>>>> >>
>>>>>>>>>>> >> I have deployed a ceph cluster using cephadm on two nodes.
>>>>>>>>>>> Now when i was
>>>>>>>>>>> >> trying to upgrade and noticed hiccups where it just upgraded
>>>>>>>>>>> a single mgr
>>>>>>>>>>> >> with 16.2.10 but not other so i started messing around and
>>>>>>>>>>> somehow I
>>>>>>>>>>> >> deleted both mgr in the thought that cephadm will recreate
>>>>>>>>>>> them.
>>>>>>>>>>> >>
>>>>>>>>>>> >> Now i don't have any single mgr so my ceph orch command hangs
>>>>>>>>>>> forever and
>>>>>>>>>>> >> looks like a chicken egg issue.
>>>>>>>>>>> >>
>>>>>>>>>>> >> How do I recover from this? If I can't run the ceph orch
>>>>>>>>>>> command, I won't
>>>>>>>>>>> >> be able to redeploy my mgr daemons.
>>>>>>>>>>> >>
>>>>>>>>>>> >> I am not able to find any mgr in the following command on
>>>>>>>>>>> both nodes.
>>>>>>>>>>> >>
>>>>>>>>>>> >> $ cephadm ls | grep mgr
>>>>>>>>>>> >>
>>>>>>>>>>> >
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>>>>>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>>>>>>>>>
>>>>>>>>>>>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux