It Looks like I did it with the following command. $ ceph orch daemon add mgr ceph2:10.73.0.192 Now i can see two with same version 15.x root@ceph1:~# ceph orch ps --daemon-type mgr NAME HOST STATUS REFRESHED AGE VERSION IMAGE NAME IMAGE ID CONTAINER ID mgr.ceph1.smfvfd ceph1 running (8h) 41s ago 8h 15.2.17 quay.io/ceph/ceph@sha256:c08064dde4bba4e72a1f55d90ca32df9ef5aafab82efe2e0a0722444a5aaacca 93146564743f 1aab837306d2 mgr.ceph2.huidoh ceph2 running (60s) 110s ago 60s 15.2.17 quay.io/ceph/ceph@sha256:c08064dde4bba4e72a1f55d90ca32df9ef5aafab82efe2e0a0722444a5aaacca 93146564743f 294fd6ab6c97 On Fri, Sep 2, 2022 at 10:19 AM Satish Patel <satish.txt@xxxxxxxxx> wrote: > Let's come back to the original question: how to bring back the second mgr? > > root@ceph1:~# ceph orch apply mgr 2 > Scheduled mgr update... > > Nothing happened with above command, logs saying nothing > > 2022-09-02T14:16:20.407927+0000 mgr.ceph1.smfvfd (mgr.334626) 16939 : > cephadm [INF] refreshing ceph2 facts > 2022-09-02T14:16:40.247195+0000 mgr.ceph1.smfvfd (mgr.334626) 16952 : > cephadm [INF] Saving service mgr spec with placement count:2 > 2022-09-02T14:16:53.106919+0000 mgr.ceph1.smfvfd (mgr.334626) 16961 : > cephadm [INF] Saving service mgr spec with placement count:2 > 2022-09-02T14:17:19.135203+0000 mgr.ceph1.smfvfd (mgr.334626) 16975 : > cephadm [INF] refreshing ceph1 facts > 2022-09-02T14:17:20.780496+0000 mgr.ceph1.smfvfd (mgr.334626) 16977 : > cephadm [INF] refreshing ceph2 facts > 2022-09-02T14:18:19.502034+0000 mgr.ceph1.smfvfd (mgr.334626) 17008 : > cephadm [INF] refreshing ceph1 facts > 2022-09-02T14:18:21.127973+0000 mgr.ceph1.smfvfd (mgr.334626) 17010 : > cephadm [INF] refreshing ceph2 facts > > > > > > > > On Fri, Sep 2, 2022 at 10:15 AM Satish Patel <satish.txt@xxxxxxxxx> wrote: > >> Hi Adam, >> >> Wait..wait.. now it's working suddenly without doing anything.. very odd >> >> root@ceph1:~# ceph orch ls >> NAME RUNNING REFRESHED AGE PLACEMENT IMAGE NAME >> >> IMAGE ID >> alertmanager 1/1 5s ago 2w count:1 >> quay.io/prometheus/alertmanager:v0.20.0 >> 0881eb8f169f >> crash 2/2 5s ago 2w * >> quay.io/ceph/ceph:v15 >> 93146564743f >> grafana 1/1 5s ago 2w count:1 >> quay.io/ceph/ceph-grafana:6.7.4 >> 557c83e11646 >> mgr 1/2 5s ago 8h count:2 >> quay.io/ceph/ceph@sha256:c08064dde4bba4e72a1f55d90ca32df9ef5aafab82efe2e0a0722444a5aaacca >> 93146564743f >> mon 1/2 5s ago 8h ceph1;ceph2 >> quay.io/ceph/ceph:v15 >> 93146564743f >> node-exporter 2/2 5s ago 2w * >> quay.io/prometheus/node-exporter:v0.18.1 >> e5a616e4b9cf >> osd.osd_spec_default 4/0 5s ago - <unmanaged> >> quay.io/ceph/ceph:v15 >> 93146564743f >> prometheus 1/1 5s ago 2w count:1 >> quay.io/prometheus/prometheus:v2.18.1 >> >> On Fri, Sep 2, 2022 at 10:13 AM Satish Patel <satish.txt@xxxxxxxxx> >> wrote: >> >>> I can see that in the output but I'm not sure how to get rid of it. >>> >>> root@ceph1:~# ceph orch ps --refresh >>> NAME >>> HOST STATUS REFRESHED AGE VERSION IMAGE NAME >>> IMAGE ID >>> CONTAINER ID >>> alertmanager.ceph1 >>> ceph1 running (9h) 64s ago 2w 0.20.0 >>> quay.io/prometheus/alertmanager:v0.20.0 >>> 0881eb8f169f ba804b555378 >>> cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d >>> ceph2 stopped 65s ago - <unknown> <unknown> >>> <unknown> >>> <unknown> >>> crash.ceph1 >>> ceph1 running (9h) 64s ago 2w 15.2.17 quay.io/ceph/ceph:v15 >>> >>> 93146564743f a3a431d834fc >>> crash.ceph2 >>> ceph2 running (9h) 65s ago 13d 15.2.17 quay.io/ceph/ceph:v15 >>> >>> 93146564743f 3c963693ff2b >>> grafana.ceph1 >>> ceph1 running (9h) 64s ago 2w 6.7.4 >>> quay.io/ceph/ceph-grafana:6.7.4 >>> 557c83e11646 7583a8dc4c61 >>> mgr.ceph1.smfvfd >>> ceph1 running (8h) 64s ago 8h 15.2.17 >>> quay.io/ceph/ceph@sha256:c08064dde4bba4e72a1f55d90ca32df9ef5aafab82efe2e0a0722444a5aaacca >>> 93146564743f 1aab837306d2 >>> mon.ceph1 >>> ceph1 running (9h) 64s ago 2w 15.2.17 quay.io/ceph/ceph:v15 >>> >>> 93146564743f c1d155d8c7ad >>> node-exporter.ceph1 >>> ceph1 running (9h) 64s ago 2w 0.18.1 >>> quay.io/prometheus/node-exporter:v0.18.1 >>> e5a616e4b9cf 2ff235fe0e42 >>> node-exporter.ceph2 >>> ceph2 running (9h) 65s ago 13d 0.18.1 >>> quay.io/prometheus/node-exporter:v0.18.1 >>> e5a616e4b9cf 17678b9ba602 >>> osd.0 >>> ceph1 running (9h) 64s ago 13d 15.2.17 quay.io/ceph/ceph:v15 >>> >>> 93146564743f d0fd73b777a3 >>> osd.1 >>> ceph1 running (9h) 64s ago 13d 15.2.17 quay.io/ceph/ceph:v15 >>> >>> 93146564743f 049120e83102 >>> osd.2 >>> ceph2 running (9h) 65s ago 13d 15.2.17 quay.io/ceph/ceph:v15 >>> >>> 93146564743f 8700e8cefd1f >>> osd.3 >>> ceph2 running (9h) 65s ago 13d 15.2.17 quay.io/ceph/ceph:v15 >>> >>> 93146564743f 9c71bc87ed16 >>> prometheus.ceph1 >>> ceph1 running (9h) 64s ago 2w 2.18.1 >>> quay.io/prometheus/prometheus:v2.18.1 >>> de242295e225 74a538efd61e >>> >>> On Fri, Sep 2, 2022 at 10:10 AM Adam King <adking@xxxxxxxxxx> wrote: >>> >>>> maybe also a "ceph orch ps --refresh"? It might still have the old >>>> cached daemon inventory from before you remove the files. >>>> >>>> On Fri, Sep 2, 2022 at 9:57 AM Satish Patel <satish.txt@xxxxxxxxx> >>>> wrote: >>>> >>>>> Hi Adam, >>>>> >>>>> I have deleted file located here - rm >>>>> /var/lib/ceph/f270ad9e-1f6f-11ed-b6f8-a539d87379ea/cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d >>>>> >>>>> But still getting the same error, do i need to do anything else? >>>>> >>>>> On Fri, Sep 2, 2022 at 9:51 AM Adam King <adking@xxxxxxxxxx> wrote: >>>>> >>>>>> Okay, I'm wondering if this is an issue with version mismatch. Having >>>>>> previously had a 16.2.10 mgr and then now having a 15.2.17 one that doesn't >>>>>> expect this sort of thing to be present. Either way, I'd think just >>>>>> deleting this cephadm.7ce656a8721deb5054c37b0cfb9038 >>>>>> 1522d521dde51fb0c5a2142314d663f63d (and any others like it) file >>>>>> would be the way forward to get orch ls working again. >>>>>> >>>>>> On Fri, Sep 2, 2022 at 9:44 AM Satish Patel <satish.txt@xxxxxxxxx> >>>>>> wrote: >>>>>> >>>>>>> Hi Adam, >>>>>>> >>>>>>> In cephadm ls i found the following service but i believe it was >>>>>>> there before also. >>>>>>> >>>>>>> { >>>>>>> "style": "cephadm:v1", >>>>>>> "name": >>>>>>> "cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d", >>>>>>> "fsid": "f270ad9e-1f6f-11ed-b6f8-a539d87379ea", >>>>>>> "systemd_unit": >>>>>>> "ceph-f270ad9e-1f6f-11ed-b6f8-a539d87379ea@cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d >>>>>>> ", >>>>>>> "enabled": false, >>>>>>> "state": "stopped", >>>>>>> "container_id": null, >>>>>>> "container_image_name": null, >>>>>>> "container_image_id": null, >>>>>>> "version": null, >>>>>>> "started": null, >>>>>>> "created": null, >>>>>>> "deployed": null, >>>>>>> "configured": null >>>>>>> }, >>>>>>> >>>>>>> Look like remove didn't work >>>>>>> >>>>>>> root@ceph1:~# ceph orch rm cephadm >>>>>>> Failed to remove service. <cephadm> was not found. >>>>>>> >>>>>>> root@ceph1:~# ceph orch rm >>>>>>> cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d >>>>>>> Failed to remove service. >>>>>>> <cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d> >>>>>>> was not found. >>>>>>> >>>>>>> On Fri, Sep 2, 2022 at 8:27 AM Adam King <adking@xxxxxxxxxx> wrote: >>>>>>> >>>>>>>> this looks like an old traceback you would get if you ended up with >>>>>>>> a service type that shouldn't be there somehow. The things I'd probably >>>>>>>> check are that "cephadm ls" on either host definitely doesn't report and >>>>>>>> strange things that aren't actually daemons in your cluster such as >>>>>>>> "cephadm.<hash>". Another thing you could maybe try, as I believe the >>>>>>>> assertion it's giving is for an unknown service type here ("AssertionError: >>>>>>>> cephadm"), is just "ceph orch rm cephadm" which would maybe cause it to >>>>>>>> remove whatever it thinks is this "cephadm" service that it has deployed. >>>>>>>> Lastly, you could try having the mgr you manually deploy be a 16.2.10 one >>>>>>>> instead of 15.2.17 (I'm assuming here, but the line numbers in that >>>>>>>> traceback suggest octopus). The 16.2.10 one is just much less likely to >>>>>>>> have a bug that causes something like this. >>>>>>>> >>>>>>>> On Fri, Sep 2, 2022 at 1:41 AM Satish Patel <satish.txt@xxxxxxxxx> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Now when I run "ceph orch ps" it works but the following command >>>>>>>>> throws an >>>>>>>>> error. Trying to bring up second mgr using ceph orch apply mgr >>>>>>>>> command but >>>>>>>>> didn't help >>>>>>>>> >>>>>>>>> root@ceph1:/ceph-disk# ceph version >>>>>>>>> ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) >>>>>>>>> octopus >>>>>>>>> (stable) >>>>>>>>> >>>>>>>>> root@ceph1:/ceph-disk# ceph orch ls >>>>>>>>> Error EINVAL: Traceback (most recent call last): >>>>>>>>> File "/usr/share/ceph/mgr/mgr_module.py", line 1212, in >>>>>>>>> _handle_command >>>>>>>>> return self.handle_command(inbuf, cmd) >>>>>>>>> File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 140, >>>>>>>>> in >>>>>>>>> handle_command >>>>>>>>> return dispatch[cmd['prefix']].call(self, cmd, inbuf) >>>>>>>>> File "/usr/share/ceph/mgr/mgr_module.py", line 320, in call >>>>>>>>> return self.func(mgr, **kwargs) >>>>>>>>> File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 102, >>>>>>>>> in >>>>>>>>> <lambda> >>>>>>>>> wrapper_copy = lambda *l_args, **l_kwargs: wrapper(*l_args, >>>>>>>>> **l_kwargs) >>>>>>>>> File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 91, >>>>>>>>> in wrapper >>>>>>>>> return func(*args, **kwargs) >>>>>>>>> File "/usr/share/ceph/mgr/orchestrator/module.py", line 503, in >>>>>>>>> _list_services >>>>>>>>> raise_if_exception(completion) >>>>>>>>> File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 642, >>>>>>>>> in >>>>>>>>> raise_if_exception >>>>>>>>> raise e >>>>>>>>> AssertionError: cephadm >>>>>>>>> >>>>>>>>> On Fri, Sep 2, 2022 at 1:32 AM Satish Patel <satish.txt@xxxxxxxxx> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>> > nevermind, i found doc related that and i am able to get 1 mgr >>>>>>>>> up - >>>>>>>>> > >>>>>>>>> https://docs.ceph.com/en/quincy/cephadm/troubleshooting/#manually-deploying-a-mgr-daemon >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > On Fri, Sep 2, 2022 at 1:21 AM Satish Patel < >>>>>>>>> satish.txt@xxxxxxxxx> wrote: >>>>>>>>> > >>>>>>>>> >> Folks, >>>>>>>>> >> >>>>>>>>> >> I am having little fun time with cephadm and it's very annoying >>>>>>>>> to deal >>>>>>>>> >> with it >>>>>>>>> >> >>>>>>>>> >> I have deployed a ceph cluster using cephadm on two nodes. Now >>>>>>>>> when i was >>>>>>>>> >> trying to upgrade and noticed hiccups where it just upgraded a >>>>>>>>> single mgr >>>>>>>>> >> with 16.2.10 but not other so i started messing around and >>>>>>>>> somehow I >>>>>>>>> >> deleted both mgr in the thought that cephadm will recreate them. >>>>>>>>> >> >>>>>>>>> >> Now i don't have any single mgr so my ceph orch command hangs >>>>>>>>> forever and >>>>>>>>> >> looks like a chicken egg issue. >>>>>>>>> >> >>>>>>>>> >> How do I recover from this? If I can't run the ceph orch >>>>>>>>> command, I won't >>>>>>>>> >> be able to redeploy my mgr daemons. >>>>>>>>> >> >>>>>>>>> >> I am not able to find any mgr in the following command on both >>>>>>>>> nodes. >>>>>>>>> >> >>>>>>>>> >> $ cephadm ls | grep mgr >>>>>>>>> >> >>>>>>>>> > >>>>>>>>> _______________________________________________ >>>>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx >>>>>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >>>>>>>>> >>>>>>>>> _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx