Hi Adam, I run the following command to upgrade but it looks like nothing is happening $ ceph orch upgrade start --image quay.io/ceph/ceph:v16.2.10 Status message is empty.. root@ceph1:~# ceph orch upgrade status { "target_image": "quay.io/ceph/ceph:v16.2.10", "in_progress": true, "services_complete": [], "message": "" } Nothing in Logs root@ceph1:~# tail -f /var/log/ceph/f270ad9e-1f6f-11ed-b6f8-a539d87379ea/ceph.cephadm.log 2022-09-02T14:31:52.597661+0000 mgr.ceph2.huidoh (mgr.344392) 174 : cephadm [INF] refreshing ceph2 facts 2022-09-02T14:31:52.991450+0000 mgr.ceph2.huidoh (mgr.344392) 176 : cephadm [INF] refreshing ceph1 facts 2022-09-02T14:32:52.965092+0000 mgr.ceph2.huidoh (mgr.344392) 207 : cephadm [INF] refreshing ceph2 facts 2022-09-02T14:32:53.369789+0000 mgr.ceph2.huidoh (mgr.344392) 208 : cephadm [INF] refreshing ceph1 facts 2022-09-02T14:33:53.367986+0000 mgr.ceph2.huidoh (mgr.344392) 239 : cephadm [INF] refreshing ceph2 facts 2022-09-02T14:33:53.760427+0000 mgr.ceph2.huidoh (mgr.344392) 240 : cephadm [INF] refreshing ceph1 facts 2022-09-02T14:34:53.754277+0000 mgr.ceph2.huidoh (mgr.344392) 272 : cephadm [INF] refreshing ceph2 facts 2022-09-02T14:34:54.162503+0000 mgr.ceph2.huidoh (mgr.344392) 273 : cephadm [INF] refreshing ceph1 facts 2022-09-02T14:35:54.133467+0000 mgr.ceph2.huidoh (mgr.344392) 305 : cephadm [INF] refreshing ceph2 facts 2022-09-02T14:35:54.522171+0000 mgr.ceph2.huidoh (mgr.344392) 306 : cephadm [INF] refreshing ceph1 facts In progress that mesg stuck there for long time root@ceph1:~# ceph -s cluster: id: f270ad9e-1f6f-11ed-b6f8-a539d87379ea health: HEALTH_OK services: mon: 1 daemons, quorum ceph1 (age 9h) mgr: ceph2.huidoh(active, since 9m), standbys: ceph1.smfvfd osd: 4 osds: 4 up (since 9h), 4 in (since 11h) data: pools: 5 pools, 129 pgs objects: 20.06k objects, 83 GiB usage: 168 GiB used, 632 GiB / 800 GiB avail pgs: 129 active+clean io: client: 12 KiB/s wr, 0 op/s rd, 1 op/s wr progress: Upgrade to quay.io/ceph/ceph:v16.2.10 (0s) [............................] On Fri, Sep 2, 2022 at 10:25 AM Satish Patel <satish.txt@xxxxxxxxx> wrote: > It Looks like I did it with the following command. > > $ ceph orch daemon add mgr ceph2:10.73.0.192 > > Now i can see two with same version 15.x > > root@ceph1:~# ceph orch ps --daemon-type mgr > NAME HOST STATUS REFRESHED AGE VERSION IMAGE > NAME > IMAGE ID CONTAINER ID > mgr.ceph1.smfvfd ceph1 running (8h) 41s ago 8h 15.2.17 > quay.io/ceph/ceph@sha256:c08064dde4bba4e72a1f55d90ca32df9ef5aafab82efe2e0a0722444a5aaacca > 93146564743f 1aab837306d2 > mgr.ceph2.huidoh ceph2 running (60s) 110s ago 60s 15.2.17 > quay.io/ceph/ceph@sha256:c08064dde4bba4e72a1f55d90ca32df9ef5aafab82efe2e0a0722444a5aaacca > 93146564743f 294fd6ab6c97 > > On Fri, Sep 2, 2022 at 10:19 AM Satish Patel <satish.txt@xxxxxxxxx> wrote: > >> Let's come back to the original question: how to bring back the second >> mgr? >> >> root@ceph1:~# ceph orch apply mgr 2 >> Scheduled mgr update... >> >> Nothing happened with above command, logs saying nothing >> >> 2022-09-02T14:16:20.407927+0000 mgr.ceph1.smfvfd (mgr.334626) 16939 : >> cephadm [INF] refreshing ceph2 facts >> 2022-09-02T14:16:40.247195+0000 mgr.ceph1.smfvfd (mgr.334626) 16952 : >> cephadm [INF] Saving service mgr spec with placement count:2 >> 2022-09-02T14:16:53.106919+0000 mgr.ceph1.smfvfd (mgr.334626) 16961 : >> cephadm [INF] Saving service mgr spec with placement count:2 >> 2022-09-02T14:17:19.135203+0000 mgr.ceph1.smfvfd (mgr.334626) 16975 : >> cephadm [INF] refreshing ceph1 facts >> 2022-09-02T14:17:20.780496+0000 mgr.ceph1.smfvfd (mgr.334626) 16977 : >> cephadm [INF] refreshing ceph2 facts >> 2022-09-02T14:18:19.502034+0000 mgr.ceph1.smfvfd (mgr.334626) 17008 : >> cephadm [INF] refreshing ceph1 facts >> 2022-09-02T14:18:21.127973+0000 mgr.ceph1.smfvfd (mgr.334626) 17010 : >> cephadm [INF] refreshing ceph2 facts >> >> >> >> >> >> >> >> On Fri, Sep 2, 2022 at 10:15 AM Satish Patel <satish.txt@xxxxxxxxx> >> wrote: >> >>> Hi Adam, >>> >>> Wait..wait.. now it's working suddenly without doing anything.. very odd >>> >>> root@ceph1:~# ceph orch ls >>> NAME RUNNING REFRESHED AGE PLACEMENT IMAGE NAME >>> >>> IMAGE ID >>> alertmanager 1/1 5s ago 2w count:1 >>> quay.io/prometheus/alertmanager:v0.20.0 >>> 0881eb8f169f >>> crash 2/2 5s ago 2w * >>> quay.io/ceph/ceph:v15 >>> 93146564743f >>> grafana 1/1 5s ago 2w count:1 >>> quay.io/ceph/ceph-grafana:6.7.4 >>> 557c83e11646 >>> mgr 1/2 5s ago 8h count:2 >>> quay.io/ceph/ceph@sha256:c08064dde4bba4e72a1f55d90ca32df9ef5aafab82efe2e0a0722444a5aaacca >>> 93146564743f >>> mon 1/2 5s ago 8h ceph1;ceph2 >>> quay.io/ceph/ceph:v15 >>> 93146564743f >>> node-exporter 2/2 5s ago 2w * >>> quay.io/prometheus/node-exporter:v0.18.1 >>> e5a616e4b9cf >>> osd.osd_spec_default 4/0 5s ago - <unmanaged> >>> quay.io/ceph/ceph:v15 >>> 93146564743f >>> prometheus 1/1 5s ago 2w count:1 >>> quay.io/prometheus/prometheus:v2.18.1 >>> >>> On Fri, Sep 2, 2022 at 10:13 AM Satish Patel <satish.txt@xxxxxxxxx> >>> wrote: >>> >>>> I can see that in the output but I'm not sure how to get rid of it. >>>> >>>> root@ceph1:~# ceph orch ps --refresh >>>> NAME >>>> HOST STATUS REFRESHED AGE VERSION IMAGE NAME >>>> IMAGE ID >>>> CONTAINER ID >>>> alertmanager.ceph1 >>>> ceph1 running (9h) 64s ago 2w 0.20.0 >>>> quay.io/prometheus/alertmanager:v0.20.0 >>>> 0881eb8f169f ba804b555378 >>>> cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d >>>> ceph2 stopped 65s ago - <unknown> <unknown> >>>> <unknown> >>>> <unknown> >>>> crash.ceph1 >>>> ceph1 running (9h) 64s ago 2w 15.2.17 quay.io/ceph/ceph:v15 >>>> >>>> 93146564743f a3a431d834fc >>>> crash.ceph2 >>>> ceph2 running (9h) 65s ago 13d 15.2.17 quay.io/ceph/ceph:v15 >>>> >>>> 93146564743f 3c963693ff2b >>>> grafana.ceph1 >>>> ceph1 running (9h) 64s ago 2w 6.7.4 >>>> quay.io/ceph/ceph-grafana:6.7.4 >>>> 557c83e11646 7583a8dc4c61 >>>> mgr.ceph1.smfvfd >>>> ceph1 running (8h) 64s ago 8h 15.2.17 >>>> quay.io/ceph/ceph@sha256:c08064dde4bba4e72a1f55d90ca32df9ef5aafab82efe2e0a0722444a5aaacca >>>> 93146564743f 1aab837306d2 >>>> mon.ceph1 >>>> ceph1 running (9h) 64s ago 2w 15.2.17 quay.io/ceph/ceph:v15 >>>> >>>> 93146564743f c1d155d8c7ad >>>> node-exporter.ceph1 >>>> ceph1 running (9h) 64s ago 2w 0.18.1 >>>> quay.io/prometheus/node-exporter:v0.18.1 >>>> e5a616e4b9cf 2ff235fe0e42 >>>> node-exporter.ceph2 >>>> ceph2 running (9h) 65s ago 13d 0.18.1 >>>> quay.io/prometheus/node-exporter:v0.18.1 >>>> e5a616e4b9cf 17678b9ba602 >>>> osd.0 >>>> ceph1 running (9h) 64s ago 13d 15.2.17 quay.io/ceph/ceph:v15 >>>> >>>> 93146564743f d0fd73b777a3 >>>> osd.1 >>>> ceph1 running (9h) 64s ago 13d 15.2.17 quay.io/ceph/ceph:v15 >>>> >>>> 93146564743f 049120e83102 >>>> osd.2 >>>> ceph2 running (9h) 65s ago 13d 15.2.17 quay.io/ceph/ceph:v15 >>>> >>>> 93146564743f 8700e8cefd1f >>>> osd.3 >>>> ceph2 running (9h) 65s ago 13d 15.2.17 quay.io/ceph/ceph:v15 >>>> >>>> 93146564743f 9c71bc87ed16 >>>> prometheus.ceph1 >>>> ceph1 running (9h) 64s ago 2w 2.18.1 >>>> quay.io/prometheus/prometheus:v2.18.1 >>>> de242295e225 74a538efd61e >>>> >>>> On Fri, Sep 2, 2022 at 10:10 AM Adam King <adking@xxxxxxxxxx> wrote: >>>> >>>>> maybe also a "ceph orch ps --refresh"? It might still have the old >>>>> cached daemon inventory from before you remove the files. >>>>> >>>>> On Fri, Sep 2, 2022 at 9:57 AM Satish Patel <satish.txt@xxxxxxxxx> >>>>> wrote: >>>>> >>>>>> Hi Adam, >>>>>> >>>>>> I have deleted file located here - rm >>>>>> /var/lib/ceph/f270ad9e-1f6f-11ed-b6f8-a539d87379ea/cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d >>>>>> >>>>>> But still getting the same error, do i need to do anything else? >>>>>> >>>>>> On Fri, Sep 2, 2022 at 9:51 AM Adam King <adking@xxxxxxxxxx> wrote: >>>>>> >>>>>>> Okay, I'm wondering if this is an issue with version mismatch. >>>>>>> Having previously had a 16.2.10 mgr and then now having a 15.2.17 one that >>>>>>> doesn't expect this sort of thing to be present. Either way, I'd think just >>>>>>> deleting this cephadm.7ce656a8721deb5054c37b0cfb9038 >>>>>>> 1522d521dde51fb0c5a2142314d663f63d (and any others like it) file >>>>>>> would be the way forward to get orch ls working again. >>>>>>> >>>>>>> On Fri, Sep 2, 2022 at 9:44 AM Satish Patel <satish.txt@xxxxxxxxx> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi Adam, >>>>>>>> >>>>>>>> In cephadm ls i found the following service but i believe it was >>>>>>>> there before also. >>>>>>>> >>>>>>>> { >>>>>>>> "style": "cephadm:v1", >>>>>>>> "name": >>>>>>>> "cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d", >>>>>>>> "fsid": "f270ad9e-1f6f-11ed-b6f8-a539d87379ea", >>>>>>>> "systemd_unit": >>>>>>>> "ceph-f270ad9e-1f6f-11ed-b6f8-a539d87379ea@cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d >>>>>>>> ", >>>>>>>> "enabled": false, >>>>>>>> "state": "stopped", >>>>>>>> "container_id": null, >>>>>>>> "container_image_name": null, >>>>>>>> "container_image_id": null, >>>>>>>> "version": null, >>>>>>>> "started": null, >>>>>>>> "created": null, >>>>>>>> "deployed": null, >>>>>>>> "configured": null >>>>>>>> }, >>>>>>>> >>>>>>>> Look like remove didn't work >>>>>>>> >>>>>>>> root@ceph1:~# ceph orch rm cephadm >>>>>>>> Failed to remove service. <cephadm> was not found. >>>>>>>> >>>>>>>> root@ceph1:~# ceph orch rm >>>>>>>> cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d >>>>>>>> Failed to remove service. >>>>>>>> <cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d> >>>>>>>> was not found. >>>>>>>> >>>>>>>> On Fri, Sep 2, 2022 at 8:27 AM Adam King <adking@xxxxxxxxxx> wrote: >>>>>>>> >>>>>>>>> this looks like an old traceback you would get if you ended up >>>>>>>>> with a service type that shouldn't be there somehow. The things I'd >>>>>>>>> probably check are that "cephadm ls" on either host definitely doesn't >>>>>>>>> report and strange things that aren't actually daemons in your cluster such >>>>>>>>> as "cephadm.<hash>". Another thing you could maybe try, as I believe the >>>>>>>>> assertion it's giving is for an unknown service type here ("AssertionError: >>>>>>>>> cephadm"), is just "ceph orch rm cephadm" which would maybe cause it to >>>>>>>>> remove whatever it thinks is this "cephadm" service that it has deployed. >>>>>>>>> Lastly, you could try having the mgr you manually deploy be a 16.2.10 one >>>>>>>>> instead of 15.2.17 (I'm assuming here, but the line numbers in that >>>>>>>>> traceback suggest octopus). The 16.2.10 one is just much less likely to >>>>>>>>> have a bug that causes something like this. >>>>>>>>> >>>>>>>>> On Fri, Sep 2, 2022 at 1:41 AM Satish Patel <satish.txt@xxxxxxxxx> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Now when I run "ceph orch ps" it works but the following command >>>>>>>>>> throws an >>>>>>>>>> error. Trying to bring up second mgr using ceph orch apply mgr >>>>>>>>>> command but >>>>>>>>>> didn't help >>>>>>>>>> >>>>>>>>>> root@ceph1:/ceph-disk# ceph version >>>>>>>>>> ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) >>>>>>>>>> octopus >>>>>>>>>> (stable) >>>>>>>>>> >>>>>>>>>> root@ceph1:/ceph-disk# ceph orch ls >>>>>>>>>> Error EINVAL: Traceback (most recent call last): >>>>>>>>>> File "/usr/share/ceph/mgr/mgr_module.py", line 1212, in >>>>>>>>>> _handle_command >>>>>>>>>> return self.handle_command(inbuf, cmd) >>>>>>>>>> File "/usr/share/ceph/mgr/orchestrator/_interface.py", line >>>>>>>>>> 140, in >>>>>>>>>> handle_command >>>>>>>>>> return dispatch[cmd['prefix']].call(self, cmd, inbuf) >>>>>>>>>> File "/usr/share/ceph/mgr/mgr_module.py", line 320, in call >>>>>>>>>> return self.func(mgr, **kwargs) >>>>>>>>>> File "/usr/share/ceph/mgr/orchestrator/_interface.py", line >>>>>>>>>> 102, in >>>>>>>>>> <lambda> >>>>>>>>>> wrapper_copy = lambda *l_args, **l_kwargs: wrapper(*l_args, >>>>>>>>>> **l_kwargs) >>>>>>>>>> File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 91, >>>>>>>>>> in wrapper >>>>>>>>>> return func(*args, **kwargs) >>>>>>>>>> File "/usr/share/ceph/mgr/orchestrator/module.py", line 503, in >>>>>>>>>> _list_services >>>>>>>>>> raise_if_exception(completion) >>>>>>>>>> File "/usr/share/ceph/mgr/orchestrator/_interface.py", line >>>>>>>>>> 642, in >>>>>>>>>> raise_if_exception >>>>>>>>>> raise e >>>>>>>>>> AssertionError: cephadm >>>>>>>>>> >>>>>>>>>> On Fri, Sep 2, 2022 at 1:32 AM Satish Patel <satish.txt@xxxxxxxxx> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> > nevermind, i found doc related that and i am able to get 1 mgr >>>>>>>>>> up - >>>>>>>>>> > >>>>>>>>>> https://docs.ceph.com/en/quincy/cephadm/troubleshooting/#manually-deploying-a-mgr-daemon >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > On Fri, Sep 2, 2022 at 1:21 AM Satish Patel < >>>>>>>>>> satish.txt@xxxxxxxxx> wrote: >>>>>>>>>> > >>>>>>>>>> >> Folks, >>>>>>>>>> >> >>>>>>>>>> >> I am having little fun time with cephadm and it's very >>>>>>>>>> annoying to deal >>>>>>>>>> >> with it >>>>>>>>>> >> >>>>>>>>>> >> I have deployed a ceph cluster using cephadm on two nodes. Now >>>>>>>>>> when i was >>>>>>>>>> >> trying to upgrade and noticed hiccups where it just upgraded a >>>>>>>>>> single mgr >>>>>>>>>> >> with 16.2.10 but not other so i started messing around and >>>>>>>>>> somehow I >>>>>>>>>> >> deleted both mgr in the thought that cephadm will recreate >>>>>>>>>> them. >>>>>>>>>> >> >>>>>>>>>> >> Now i don't have any single mgr so my ceph orch command hangs >>>>>>>>>> forever and >>>>>>>>>> >> looks like a chicken egg issue. >>>>>>>>>> >> >>>>>>>>>> >> How do I recover from this? If I can't run the ceph orch >>>>>>>>>> command, I won't >>>>>>>>>> >> be able to redeploy my mgr daemons. >>>>>>>>>> >> >>>>>>>>>> >> I am not able to find any mgr in the following command on both >>>>>>>>>> nodes. >>>>>>>>>> >> >>>>>>>>>> >> $ cephadm ls | grep mgr >>>>>>>>>> >> >>>>>>>>>> > >>>>>>>>>> _______________________________________________ >>>>>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx >>>>>>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >>>>>>>>>> >>>>>>>>>> _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx