Adam, In google someone suggested a manual upgrade using the following method and it seems to work but I am stuck in MON redeploy.. haha Go to mgr container and edit /var/lib/ceph/$fsid/mgr.$whatever/unit.run file and change ceph/ceph:v16.2.10 on both mgr and restart mgr service using systemctl restart <mgr> After a few minutes I noticed the docker downloaded image and I can see both mgr running with the 16.2.10 version. Now i have tried to do an upgrade and nothing happened so I used the same manual method with MON node and did use command ceph orch daemon redeploy mon.ceph1 which destroyed mon service and now i can't do anything because i don't have mon. ceph -s and all other command hangs Try to find out how to get back mon :) On Fri, Sep 2, 2022 at 3:34 PM Satish Patel <satish.txt@xxxxxxxxx> wrote: > Yes, i have stopped upgrade and those log before upgrade > > On Fri, Sep 2, 2022 at 3:27 PM Adam King <adking@xxxxxxxxxx> wrote: > >> I don't think the number of mons should have any effect on this. Looking >> at your logs, the interesting thing is that all the messages are so close >> together. Was this before having stopped the upgrade? >> >> On Fri, Sep 2, 2022 at 2:53 PM Satish Patel <satish.txt@xxxxxxxxx> wrote: >> >>> Do you think this is because I have only a single MON daemon running? I >>> have only two nodes. >>> >>> On Fri, Sep 2, 2022 at 2:39 PM Satish Patel <satish.txt@xxxxxxxxx> >>> wrote: >>> >>>> Adam, >>>> >>>> I have enabled debug and my logs flood with the following. I am going >>>> to try some stuff from your provided mailing list and see.. >>>> >>>> root@ceph1:~# tail -f >>>> /var/log/ceph/f270ad9e-1f6f-11ed-b6f8-a539d87379ea/ceph.cephadm.log >>>> 2022-09-02T18:38:21.754391+0000 mgr.ceph2.huidoh (mgr.344392) 211198 : >>>> cephadm [DBG] 0 OSDs are scheduled for removal: [] >>>> 2022-09-02T18:38:21.754519+0000 mgr.ceph2.huidoh (mgr.344392) 211199 : >>>> cephadm [DBG] Saving [] to store >>>> 2022-09-02T18:38:21.757155+0000 mgr.ceph2.huidoh (mgr.344392) 211200 : >>>> cephadm [DBG] refreshing hosts and daemons >>>> 2022-09-02T18:38:21.758065+0000 mgr.ceph2.huidoh (mgr.344392) 211201 : >>>> cephadm [DBG] _check_for_strays >>>> 2022-09-02T18:38:21.758334+0000 mgr.ceph2.huidoh (mgr.344392) 211202 : >>>> cephadm [DBG] 0 OSDs are scheduled for removal: [] >>>> 2022-09-02T18:38:21.758455+0000 mgr.ceph2.huidoh (mgr.344392) 211203 : >>>> cephadm [DBG] Saving [] to store >>>> 2022-09-02T18:38:21.761001+0000 mgr.ceph2.huidoh (mgr.344392) 211204 : >>>> cephadm [DBG] refreshing hosts and daemons >>>> 2022-09-02T18:38:21.762092+0000 mgr.ceph2.huidoh (mgr.344392) 211205 : >>>> cephadm [DBG] _check_for_strays >>>> 2022-09-02T18:38:21.762357+0000 mgr.ceph2.huidoh (mgr.344392) 211206 : >>>> cephadm [DBG] 0 OSDs are scheduled for removal: [] >>>> 2022-09-02T18:38:21.762480+0000 mgr.ceph2.huidoh (mgr.344392) 211207 : >>>> cephadm [DBG] Saving [] to store >>>> >>>> On Fri, Sep 2, 2022 at 12:17 PM Adam King <adking@xxxxxxxxxx> wrote: >>>> >>>>> hmm, okay. It seems like cephadm is stuck in general rather than an >>>>> issue specific to the upgrade. I'd first make sure the orchestrator isn't >>>>> paused (just running "ceph orch resume" should be enough, it's idempotent). >>>>> >>>>> Beyond that, there was someone else who had an issue with things >>>>> getting stuck that was resolved in this thread >>>>> https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/NKKLV5TMHFA3ERGCMJ3M7BVLA5PGIR4M/#NKKLV5TMHFA3ERGCMJ3M7BVLA5PGIR4M >>>>> <https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/NKKLV5TMHFA3ERGCMJ3M7BVLA5PGIR4M/#NKKLV5TMHFA3ERGCMJ3M7BVLA5PGIR4M> that >>>>> might be worth a look. >>>>> >>>>> If you haven't already, it's possible stopping the upgrade is a good >>>>> idea, as maybe that's interfering with it getting to the point where it >>>>> does the redeploy. >>>>> >>>>> If none of those help, it might be worth setting the log level to >>>>> debug and seeing where things are ending up ("ceph config set mgr >>>>> mgr/cephadm/log_to_cluster_level debug; ceph orch ps --refresh" then >>>>> waiting a few minutes before running "ceph log last 100 debug cephadm" (not >>>>> 100% on format of that command, if it fails try just "ceph log last >>>>> cephadm"). We could maybe get more info on why it's not performing the >>>>> redeploy from those debug logs. Just remember to set the log level back >>>>> after 'ceph config set mgr mgr/cephadm/log_to_cluster_level info' as debug >>>>> logs are quite verbose. >>>>> >>>>> On Fri, Sep 2, 2022 at 11:39 AM Satish Patel <satish.txt@xxxxxxxxx> >>>>> wrote: >>>>> >>>>>> Hi Adam, >>>>>> >>>>>> As you said, i did following >>>>>> >>>>>> $ ceph orch daemon redeploy mgr.ceph1.smfvfd >>>>>> quay.io/ceph/ceph:v16.2.10 >>>>>> >>>>>> Noticed following line in logs but then no activity nothing, still >>>>>> standby mgr running in older version >>>>>> >>>>>> 2022-09-02T15:35:45.753093+0000 mgr.ceph2.huidoh (mgr.344392) 2226 : >>>>>> cephadm [INF] Schedule redeploy daemon mgr.ceph1.smfvfd >>>>>> 2022-09-02T15:36:17.279190+0000 mgr.ceph2.huidoh (mgr.344392) 2245 : >>>>>> cephadm [INF] refreshing ceph2 facts >>>>>> 2022-09-02T15:36:17.984478+0000 mgr.ceph2.huidoh (mgr.344392) 2246 : >>>>>> cephadm [INF] refreshing ceph1 facts >>>>>> 2022-09-02T15:37:17.663730+0000 mgr.ceph2.huidoh (mgr.344392) 2284 : >>>>>> cephadm [INF] refreshing ceph2 facts >>>>>> 2022-09-02T15:37:18.386586+0000 mgr.ceph2.huidoh (mgr.344392) 2285 : >>>>>> cephadm [INF] refreshing ceph1 facts >>>>>> >>>>>> I am not seeing any image get downloaded also >>>>>> >>>>>> root@ceph1:~# docker image ls >>>>>> REPOSITORY TAG IMAGE ID CREATED >>>>>> SIZE >>>>>> quay.io/ceph/ceph v15 93146564743f 3 weeks >>>>>> ago 1.2GB >>>>>> quay.io/ceph/ceph-grafana 8.3.5 dad864ee21e9 4 months >>>>>> ago 558MB >>>>>> quay.io/prometheus/prometheus v2.33.4 514e6a882f6e 6 months >>>>>> ago 204MB >>>>>> quay.io/prometheus/alertmanager v0.23.0 ba2b418f427c 12 >>>>>> months ago 57.5MB >>>>>> quay.io/ceph/ceph-grafana 6.7.4 557c83e11646 13 >>>>>> months ago 486MB >>>>>> quay.io/prometheus/prometheus v2.18.1 de242295e225 2 years >>>>>> ago 140MB >>>>>> quay.io/prometheus/alertmanager v0.20.0 0881eb8f169f 2 years >>>>>> ago 52.1MB >>>>>> quay.io/prometheus/node-exporter v0.18.1 e5a616e4b9cf 3 years >>>>>> ago 22.9MB >>>>>> >>>>>> >>>>>> On Fri, Sep 2, 2022 at 11:06 AM Adam King <adking@xxxxxxxxxx> wrote: >>>>>> >>>>>>> hmm, at this point, maybe we should just try manually upgrading the >>>>>>> mgr daemons and then move from there. First, just stop the upgrade "ceph >>>>>>> orch upgrade stop". If you figure out which of the two mgr daemons is the >>>>>>> standby (it should say which one is active in "ceph -s" output) and then do >>>>>>> a "ceph orch daemon redeploy <standby-mgr-name> >>>>>>> quay.io/ceph/ceph:v16.2.10" it should redeploy that specific mgr >>>>>>> with the new version. You could then do a "ceph mgr fail" to swap which of >>>>>>> the mgr daemons is active, then do another "ceph orch daemon redeploy >>>>>>> <standby-mgr-name> quay.io/ceph/ceph:v16.2.10" where the standby is >>>>>>> now the other mgr still on 15.2.17. Once the mgr daemons are both upgraded >>>>>>> to the new version, run a "ceph orch redeploy mgr" and then "ceph orch >>>>>>> upgrade start --image quay.io/ceph/ceph:v16.2.10" and see if it >>>>>>> goes better. >>>>>>> >>>>>>> On Fri, Sep 2, 2022 at 10:36 AM Satish Patel <satish.txt@xxxxxxxxx> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi Adam, >>>>>>>> >>>>>>>> I run the following command to upgrade but it looks like nothing is >>>>>>>> happening >>>>>>>> >>>>>>>> $ ceph orch upgrade start --image quay.io/ceph/ceph:v16.2.10 >>>>>>>> >>>>>>>> Status message is empty.. >>>>>>>> >>>>>>>> root@ceph1:~# ceph orch upgrade status >>>>>>>> { >>>>>>>> "target_image": "quay.io/ceph/ceph:v16.2.10", >>>>>>>> "in_progress": true, >>>>>>>> "services_complete": [], >>>>>>>> "message": "" >>>>>>>> } >>>>>>>> >>>>>>>> Nothing in Logs >>>>>>>> >>>>>>>> root@ceph1:~# tail -f >>>>>>>> /var/log/ceph/f270ad9e-1f6f-11ed-b6f8-a539d87379ea/ceph.cephadm.log >>>>>>>> 2022-09-02T14:31:52.597661+0000 mgr.ceph2.huidoh (mgr.344392) 174 : >>>>>>>> cephadm [INF] refreshing ceph2 facts >>>>>>>> 2022-09-02T14:31:52.991450+0000 mgr.ceph2.huidoh (mgr.344392) 176 : >>>>>>>> cephadm [INF] refreshing ceph1 facts >>>>>>>> 2022-09-02T14:32:52.965092+0000 mgr.ceph2.huidoh (mgr.344392) 207 : >>>>>>>> cephadm [INF] refreshing ceph2 facts >>>>>>>> 2022-09-02T14:32:53.369789+0000 mgr.ceph2.huidoh (mgr.344392) 208 : >>>>>>>> cephadm [INF] refreshing ceph1 facts >>>>>>>> 2022-09-02T14:33:53.367986+0000 mgr.ceph2.huidoh (mgr.344392) 239 : >>>>>>>> cephadm [INF] refreshing ceph2 facts >>>>>>>> 2022-09-02T14:33:53.760427+0000 mgr.ceph2.huidoh (mgr.344392) 240 : >>>>>>>> cephadm [INF] refreshing ceph1 facts >>>>>>>> 2022-09-02T14:34:53.754277+0000 mgr.ceph2.huidoh (mgr.344392) 272 : >>>>>>>> cephadm [INF] refreshing ceph2 facts >>>>>>>> 2022-09-02T14:34:54.162503+0000 mgr.ceph2.huidoh (mgr.344392) 273 : >>>>>>>> cephadm [INF] refreshing ceph1 facts >>>>>>>> 2022-09-02T14:35:54.133467+0000 mgr.ceph2.huidoh (mgr.344392) 305 : >>>>>>>> cephadm [INF] refreshing ceph2 facts >>>>>>>> 2022-09-02T14:35:54.522171+0000 mgr.ceph2.huidoh (mgr.344392) 306 : >>>>>>>> cephadm [INF] refreshing ceph1 facts >>>>>>>> >>>>>>>> In progress that mesg stuck there for long time >>>>>>>> >>>>>>>> root@ceph1:~# ceph -s >>>>>>>> cluster: >>>>>>>> id: f270ad9e-1f6f-11ed-b6f8-a539d87379ea >>>>>>>> health: HEALTH_OK >>>>>>>> >>>>>>>> services: >>>>>>>> mon: 1 daemons, quorum ceph1 (age 9h) >>>>>>>> mgr: ceph2.huidoh(active, since 9m), standbys: ceph1.smfvfd >>>>>>>> osd: 4 osds: 4 up (since 9h), 4 in (since 11h) >>>>>>>> >>>>>>>> data: >>>>>>>> pools: 5 pools, 129 pgs >>>>>>>> objects: 20.06k objects, 83 GiB >>>>>>>> usage: 168 GiB used, 632 GiB / 800 GiB avail >>>>>>>> pgs: 129 active+clean >>>>>>>> >>>>>>>> io: >>>>>>>> client: 12 KiB/s wr, 0 op/s rd, 1 op/s wr >>>>>>>> >>>>>>>> progress: >>>>>>>> Upgrade to quay.io/ceph/ceph:v16.2.10 (0s) >>>>>>>> [............................] >>>>>>>> >>>>>>>> On Fri, Sep 2, 2022 at 10:25 AM Satish Patel <satish.txt@xxxxxxxxx> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> It Looks like I did it with the following command. >>>>>>>>> >>>>>>>>> $ ceph orch daemon add mgr ceph2:10.73.0.192 >>>>>>>>> >>>>>>>>> Now i can see two with same version 15.x >>>>>>>>> >>>>>>>>> root@ceph1:~# ceph orch ps --daemon-type mgr >>>>>>>>> NAME HOST STATUS REFRESHED AGE VERSION >>>>>>>>> IMAGE NAME >>>>>>>>> IMAGE ID CONTAINER ID >>>>>>>>> mgr.ceph1.smfvfd ceph1 running (8h) 41s ago 8h 15.2.17 >>>>>>>>> quay.io/ceph/ceph@sha256:c08064dde4bba4e72a1f55d90ca32df9ef5aafab82efe2e0a0722444a5aaacca >>>>>>>>> 93146564743f 1aab837306d2 >>>>>>>>> mgr.ceph2.huidoh ceph2 running (60s) 110s ago 60s 15.2.17 >>>>>>>>> quay.io/ceph/ceph@sha256:c08064dde4bba4e72a1f55d90ca32df9ef5aafab82efe2e0a0722444a5aaacca >>>>>>>>> 93146564743f 294fd6ab6c97 >>>>>>>>> >>>>>>>>> On Fri, Sep 2, 2022 at 10:19 AM Satish Patel <satish.txt@xxxxxxxxx> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Let's come back to the original question: how to bring back the >>>>>>>>>> second mgr? >>>>>>>>>> >>>>>>>>>> root@ceph1:~# ceph orch apply mgr 2 >>>>>>>>>> Scheduled mgr update... >>>>>>>>>> >>>>>>>>>> Nothing happened with above command, logs saying nothing >>>>>>>>>> >>>>>>>>>> 2022-09-02T14:16:20.407927+0000 mgr.ceph1.smfvfd (mgr.334626) >>>>>>>>>> 16939 : cephadm [INF] refreshing ceph2 facts >>>>>>>>>> 2022-09-02T14:16:40.247195+0000 mgr.ceph1.smfvfd (mgr.334626) >>>>>>>>>> 16952 : cephadm [INF] Saving service mgr spec with placement count:2 >>>>>>>>>> 2022-09-02T14:16:53.106919+0000 mgr.ceph1.smfvfd (mgr.334626) >>>>>>>>>> 16961 : cephadm [INF] Saving service mgr spec with placement count:2 >>>>>>>>>> 2022-09-02T14:17:19.135203+0000 mgr.ceph1.smfvfd (mgr.334626) >>>>>>>>>> 16975 : cephadm [INF] refreshing ceph1 facts >>>>>>>>>> 2022-09-02T14:17:20.780496+0000 mgr.ceph1.smfvfd (mgr.334626) >>>>>>>>>> 16977 : cephadm [INF] refreshing ceph2 facts >>>>>>>>>> 2022-09-02T14:18:19.502034+0000 mgr.ceph1.smfvfd (mgr.334626) >>>>>>>>>> 17008 : cephadm [INF] refreshing ceph1 facts >>>>>>>>>> 2022-09-02T14:18:21.127973+0000 mgr.ceph1.smfvfd (mgr.334626) >>>>>>>>>> 17010 : cephadm [INF] refreshing ceph2 facts >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Fri, Sep 2, 2022 at 10:15 AM Satish Patel < >>>>>>>>>> satish.txt@xxxxxxxxx> wrote: >>>>>>>>>> >>>>>>>>>>> Hi Adam, >>>>>>>>>>> >>>>>>>>>>> Wait..wait.. now it's working suddenly without doing anything.. >>>>>>>>>>> very odd >>>>>>>>>>> >>>>>>>>>>> root@ceph1:~# ceph orch ls >>>>>>>>>>> NAME RUNNING REFRESHED AGE PLACEMENT >>>>>>>>>>> IMAGE NAME >>>>>>>>>>> IMAGE ID >>>>>>>>>>> alertmanager 1/1 5s ago 2w count:1 >>>>>>>>>>> quay.io/prometheus/alertmanager:v0.20.0 >>>>>>>>>>> 0881eb8f169f >>>>>>>>>>> crash 2/2 5s ago 2w * >>>>>>>>>>> quay.io/ceph/ceph:v15 >>>>>>>>>>> 93146564743f >>>>>>>>>>> grafana 1/1 5s ago 2w count:1 >>>>>>>>>>> quay.io/ceph/ceph-grafana:6.7.4 >>>>>>>>>>> 557c83e11646 >>>>>>>>>>> mgr 1/2 5s ago 8h count:2 >>>>>>>>>>> quay.io/ceph/ceph@sha256:c08064dde4bba4e72a1f55d90ca32df9ef5aafab82efe2e0a0722444a5aaacca >>>>>>>>>>> 93146564743f >>>>>>>>>>> mon 1/2 5s ago 8h ceph1;ceph2 >>>>>>>>>>> quay.io/ceph/ceph:v15 >>>>>>>>>>> 93146564743f >>>>>>>>>>> node-exporter 2/2 5s ago 2w * >>>>>>>>>>> quay.io/prometheus/node-exporter:v0.18.1 >>>>>>>>>>> e5a616e4b9cf >>>>>>>>>>> osd.osd_spec_default 4/0 5s ago - <unmanaged> >>>>>>>>>>> quay.io/ceph/ceph:v15 >>>>>>>>>>> 93146564743f >>>>>>>>>>> prometheus 1/1 5s ago 2w count:1 >>>>>>>>>>> quay.io/prometheus/prometheus:v2.18.1 >>>>>>>>>>> >>>>>>>>>>> On Fri, Sep 2, 2022 at 10:13 AM Satish Patel < >>>>>>>>>>> satish.txt@xxxxxxxxx> wrote: >>>>>>>>>>> >>>>>>>>>>>> I can see that in the output but I'm not sure how to get rid of >>>>>>>>>>>> it. >>>>>>>>>>>> >>>>>>>>>>>> root@ceph1:~# ceph orch ps --refresh >>>>>>>>>>>> NAME >>>>>>>>>>>> HOST STATUS REFRESHED AGE VERSION IMAGE NAME >>>>>>>>>>>> >>>>>>>>>>>> IMAGE ID CONTAINER ID >>>>>>>>>>>> alertmanager.ceph1 >>>>>>>>>>>> ceph1 running (9h) 64s ago 2w 0.20.0 >>>>>>>>>>>> quay.io/prometheus/alertmanager:v0.20.0 >>>>>>>>>>>> 0881eb8f169f ba804b555378 >>>>>>>>>>>> cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d >>>>>>>>>>>> ceph2 stopped 65s ago - <unknown> <unknown> >>>>>>>>>>>> <unknown> >>>>>>>>>>>> <unknown> >>>>>>>>>>>> crash.ceph1 >>>>>>>>>>>> ceph1 running (9h) 64s ago 2w 15.2.17 >>>>>>>>>>>> quay.io/ceph/ceph:v15 >>>>>>>>>>>> 93146564743f a3a431d834fc >>>>>>>>>>>> crash.ceph2 >>>>>>>>>>>> ceph2 running (9h) 65s ago 13d 15.2.17 >>>>>>>>>>>> quay.io/ceph/ceph:v15 >>>>>>>>>>>> 93146564743f 3c963693ff2b >>>>>>>>>>>> grafana.ceph1 >>>>>>>>>>>> ceph1 running (9h) 64s ago 2w 6.7.4 >>>>>>>>>>>> quay.io/ceph/ceph-grafana:6.7.4 >>>>>>>>>>>> 557c83e11646 7583a8dc4c61 >>>>>>>>>>>> mgr.ceph1.smfvfd >>>>>>>>>>>> ceph1 running (8h) 64s ago 8h 15.2.17 >>>>>>>>>>>> quay.io/ceph/ceph@sha256:c08064dde4bba4e72a1f55d90ca32df9ef5aafab82efe2e0a0722444a5aaacca >>>>>>>>>>>> 93146564743f 1aab837306d2 >>>>>>>>>>>> mon.ceph1 >>>>>>>>>>>> ceph1 running (9h) 64s ago 2w 15.2.17 >>>>>>>>>>>> quay.io/ceph/ceph:v15 >>>>>>>>>>>> 93146564743f c1d155d8c7ad >>>>>>>>>>>> node-exporter.ceph1 >>>>>>>>>>>> ceph1 running (9h) 64s ago 2w 0.18.1 >>>>>>>>>>>> quay.io/prometheus/node-exporter:v0.18.1 >>>>>>>>>>>> e5a616e4b9cf 2ff235fe0e42 >>>>>>>>>>>> node-exporter.ceph2 >>>>>>>>>>>> ceph2 running (9h) 65s ago 13d 0.18.1 >>>>>>>>>>>> quay.io/prometheus/node-exporter:v0.18.1 >>>>>>>>>>>> e5a616e4b9cf 17678b9ba602 >>>>>>>>>>>> osd.0 >>>>>>>>>>>> ceph1 running (9h) 64s ago 13d 15.2.17 >>>>>>>>>>>> quay.io/ceph/ceph:v15 >>>>>>>>>>>> 93146564743f d0fd73b777a3 >>>>>>>>>>>> osd.1 >>>>>>>>>>>> ceph1 running (9h) 64s ago 13d 15.2.17 >>>>>>>>>>>> quay.io/ceph/ceph:v15 >>>>>>>>>>>> 93146564743f 049120e83102 >>>>>>>>>>>> osd.2 >>>>>>>>>>>> ceph2 running (9h) 65s ago 13d 15.2.17 >>>>>>>>>>>> quay.io/ceph/ceph:v15 >>>>>>>>>>>> 93146564743f 8700e8cefd1f >>>>>>>>>>>> osd.3 >>>>>>>>>>>> ceph2 running (9h) 65s ago 13d 15.2.17 >>>>>>>>>>>> quay.io/ceph/ceph:v15 >>>>>>>>>>>> 93146564743f 9c71bc87ed16 >>>>>>>>>>>> prometheus.ceph1 >>>>>>>>>>>> ceph1 running (9h) 64s ago 2w 2.18.1 >>>>>>>>>>>> quay.io/prometheus/prometheus:v2.18.1 >>>>>>>>>>>> de242295e225 74a538efd61e >>>>>>>>>>>> >>>>>>>>>>>> On Fri, Sep 2, 2022 at 10:10 AM Adam King <adking@xxxxxxxxxx> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> maybe also a "ceph orch ps --refresh"? It might still have the >>>>>>>>>>>>> old cached daemon inventory from before you remove the files. >>>>>>>>>>>>> >>>>>>>>>>>>> On Fri, Sep 2, 2022 at 9:57 AM Satish Patel < >>>>>>>>>>>>> satish.txt@xxxxxxxxx> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi Adam, >>>>>>>>>>>>>> >>>>>>>>>>>>>> I have deleted file located here - rm >>>>>>>>>>>>>> /var/lib/ceph/f270ad9e-1f6f-11ed-b6f8-a539d87379ea/cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d >>>>>>>>>>>>>> >>>>>>>>>>>>>> But still getting the same error, do i need to do anything >>>>>>>>>>>>>> else? >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Fri, Sep 2, 2022 at 9:51 AM Adam King <adking@xxxxxxxxxx> >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Okay, I'm wondering if this is an issue with version >>>>>>>>>>>>>>> mismatch. Having previously had a 16.2.10 mgr and then now having a 15.2.17 >>>>>>>>>>>>>>> one that doesn't expect this sort of thing to be present. Either way, I'd >>>>>>>>>>>>>>> think just deleting this cephadm. >>>>>>>>>>>>>>> 7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d >>>>>>>>>>>>>>> (and any others like it) file would be the way forward to >>>>>>>>>>>>>>> get orch ls working again. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Fri, Sep 2, 2022 at 9:44 AM Satish Patel < >>>>>>>>>>>>>>> satish.txt@xxxxxxxxx> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi Adam, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> In cephadm ls i found the following service but i believe >>>>>>>>>>>>>>>> it was there before also. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> { >>>>>>>>>>>>>>>> "style": "cephadm:v1", >>>>>>>>>>>>>>>> "name": >>>>>>>>>>>>>>>> "cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d", >>>>>>>>>>>>>>>> "fsid": "f270ad9e-1f6f-11ed-b6f8-a539d87379ea", >>>>>>>>>>>>>>>> "systemd_unit": >>>>>>>>>>>>>>>> "ceph-f270ad9e-1f6f-11ed-b6f8-a539d87379ea@cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d >>>>>>>>>>>>>>>> ", >>>>>>>>>>>>>>>> "enabled": false, >>>>>>>>>>>>>>>> "state": "stopped", >>>>>>>>>>>>>>>> "container_id": null, >>>>>>>>>>>>>>>> "container_image_name": null, >>>>>>>>>>>>>>>> "container_image_id": null, >>>>>>>>>>>>>>>> "version": null, >>>>>>>>>>>>>>>> "started": null, >>>>>>>>>>>>>>>> "created": null, >>>>>>>>>>>>>>>> "deployed": null, >>>>>>>>>>>>>>>> "configured": null >>>>>>>>>>>>>>>> }, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Look like remove didn't work >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> root@ceph1:~# ceph orch rm cephadm >>>>>>>>>>>>>>>> Failed to remove service. <cephadm> was not found. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> root@ceph1:~# ceph orch rm >>>>>>>>>>>>>>>> cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d >>>>>>>>>>>>>>>> Failed to remove service. >>>>>>>>>>>>>>>> <cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d> >>>>>>>>>>>>>>>> was not found. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Fri, Sep 2, 2022 at 8:27 AM Adam King <adking@xxxxxxxxxx> >>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> this looks like an old traceback you would get if you >>>>>>>>>>>>>>>>> ended up with a service type that shouldn't be there somehow. The things >>>>>>>>>>>>>>>>> I'd probably check are that "cephadm ls" on either host definitely doesn't >>>>>>>>>>>>>>>>> report and strange things that aren't actually daemons in your cluster such >>>>>>>>>>>>>>>>> as "cephadm.<hash>". Another thing you could maybe try, as I believe the >>>>>>>>>>>>>>>>> assertion it's giving is for an unknown service type here ("AssertionError: >>>>>>>>>>>>>>>>> cephadm"), is just "ceph orch rm cephadm" which would maybe cause it to >>>>>>>>>>>>>>>>> remove whatever it thinks is this "cephadm" service that it has deployed. >>>>>>>>>>>>>>>>> Lastly, you could try having the mgr you manually deploy be a 16.2.10 one >>>>>>>>>>>>>>>>> instead of 15.2.17 (I'm assuming here, but the line numbers in that >>>>>>>>>>>>>>>>> traceback suggest octopus). The 16.2.10 one is just much less likely to >>>>>>>>>>>>>>>>> have a bug that causes something like this. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Fri, Sep 2, 2022 at 1:41 AM Satish Patel < >>>>>>>>>>>>>>>>> satish.txt@xxxxxxxxx> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Now when I run "ceph orch ps" it works but the following >>>>>>>>>>>>>>>>>> command throws an >>>>>>>>>>>>>>>>>> error. Trying to bring up second mgr using ceph orch >>>>>>>>>>>>>>>>>> apply mgr command but >>>>>>>>>>>>>>>>>> didn't help >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> root@ceph1:/ceph-disk# ceph version >>>>>>>>>>>>>>>>>> ceph version 15.2.17 >>>>>>>>>>>>>>>>>> (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus >>>>>>>>>>>>>>>>>> (stable) >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> root@ceph1:/ceph-disk# ceph orch ls >>>>>>>>>>>>>>>>>> Error EINVAL: Traceback (most recent call last): >>>>>>>>>>>>>>>>>> File "/usr/share/ceph/mgr/mgr_module.py", line 1212, in >>>>>>>>>>>>>>>>>> _handle_command >>>>>>>>>>>>>>>>>> return self.handle_command(inbuf, cmd) >>>>>>>>>>>>>>>>>> File "/usr/share/ceph/mgr/orchestrator/_interface.py", >>>>>>>>>>>>>>>>>> line 140, in >>>>>>>>>>>>>>>>>> handle_command >>>>>>>>>>>>>>>>>> return dispatch[cmd['prefix']].call(self, cmd, inbuf) >>>>>>>>>>>>>>>>>> File "/usr/share/ceph/mgr/mgr_module.py", line 320, in >>>>>>>>>>>>>>>>>> call >>>>>>>>>>>>>>>>>> return self.func(mgr, **kwargs) >>>>>>>>>>>>>>>>>> File "/usr/share/ceph/mgr/orchestrator/_interface.py", >>>>>>>>>>>>>>>>>> line 102, in >>>>>>>>>>>>>>>>>> <lambda> >>>>>>>>>>>>>>>>>> wrapper_copy = lambda *l_args, **l_kwargs: >>>>>>>>>>>>>>>>>> wrapper(*l_args, **l_kwargs) >>>>>>>>>>>>>>>>>> File "/usr/share/ceph/mgr/orchestrator/_interface.py", >>>>>>>>>>>>>>>>>> line 91, in wrapper >>>>>>>>>>>>>>>>>> return func(*args, **kwargs) >>>>>>>>>>>>>>>>>> File "/usr/share/ceph/mgr/orchestrator/module.py", line >>>>>>>>>>>>>>>>>> 503, in >>>>>>>>>>>>>>>>>> _list_services >>>>>>>>>>>>>>>>>> raise_if_exception(completion) >>>>>>>>>>>>>>>>>> File "/usr/share/ceph/mgr/orchestrator/_interface.py", >>>>>>>>>>>>>>>>>> line 642, in >>>>>>>>>>>>>>>>>> raise_if_exception >>>>>>>>>>>>>>>>>> raise e >>>>>>>>>>>>>>>>>> AssertionError: cephadm >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Fri, Sep 2, 2022 at 1:32 AM Satish Patel < >>>>>>>>>>>>>>>>>> satish.txt@xxxxxxxxx> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> > nevermind, i found doc related that and i am able to >>>>>>>>>>>>>>>>>> get 1 mgr up - >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> https://docs.ceph.com/en/quincy/cephadm/troubleshooting/#manually-deploying-a-mgr-daemon >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> > On Fri, Sep 2, 2022 at 1:21 AM Satish Patel < >>>>>>>>>>>>>>>>>> satish.txt@xxxxxxxxx> wrote: >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> >> Folks, >>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>> >> I am having little fun time with cephadm and it's very >>>>>>>>>>>>>>>>>> annoying to deal >>>>>>>>>>>>>>>>>> >> with it >>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>> >> I have deployed a ceph cluster using cephadm on two >>>>>>>>>>>>>>>>>> nodes. Now when i was >>>>>>>>>>>>>>>>>> >> trying to upgrade and noticed hiccups where it just >>>>>>>>>>>>>>>>>> upgraded a single mgr >>>>>>>>>>>>>>>>>> >> with 16.2.10 but not other so i started messing around >>>>>>>>>>>>>>>>>> and somehow I >>>>>>>>>>>>>>>>>> >> deleted both mgr in the thought that cephadm will >>>>>>>>>>>>>>>>>> recreate them. >>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>> >> Now i don't have any single mgr so my ceph orch >>>>>>>>>>>>>>>>>> command hangs forever and >>>>>>>>>>>>>>>>>> >> looks like a chicken egg issue. >>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>> >> How do I recover from this? If I can't run the ceph >>>>>>>>>>>>>>>>>> orch command, I won't >>>>>>>>>>>>>>>>>> >> be able to redeploy my mgr daemons. >>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>> >> I am not able to find any mgr in the following command >>>>>>>>>>>>>>>>>> on both nodes. >>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>> >> $ cephadm ls | grep mgr >>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx >>>>>>>>>>>>>>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx