Re: [cephadm] mgr: no daemons active

Satish Patel <satish.txt@xxxxxxxxx> · Fri, 2 Sep 2022 15:41:55 -0400

Adam,

In google someone suggested a manual upgrade using the following method and
it seems to work but I am stuck in MON redeploy.. haha

Go to mgr container and edit /var/lib/ceph/$fsid/mgr.$whatever/unit.run
file and change ceph/ceph:v16.2.10 on both mgr and restart mgr service
using systemctl restart <mgr>

After a few minutes I noticed the docker downloaded image and I can see
both mgr running with the 16.2.10 version.

Now i have tried to do an upgrade and nothing happened so I used the same
manual method with MON node and did use command ceph orch daemon redeploy
mon.ceph1 which destroyed mon service and now i can't do anything because i
don't have mon. ceph -s and all other command hangs

Try to find out how to get back mon :)

On Fri, Sep 2, 2022 at 3:34 PM Satish Patel <satish.txt@xxxxxxxxx> wrote:

> Yes, i have stopped upgrade and those log before upgrade
>
> On Fri, Sep 2, 2022 at 3:27 PM Adam King <adking@xxxxxxxxxx> wrote:
>
>> I don't think the number of mons should have any effect on this. Looking
>> at your logs, the interesting thing is that all the messages are so close
>> together. Was this before having stopped the upgrade?
>>
>> On Fri, Sep 2, 2022 at 2:53 PM Satish Patel <satish.txt@xxxxxxxxx> wrote:
>>
>>> Do you think this is because I have only a single MON daemon running?  I
>>> have only two nodes.
>>>
>>> On Fri, Sep 2, 2022 at 2:39 PM Satish Patel <satish.txt@xxxxxxxxx>
>>> wrote:
>>>
>>>> Adam,
>>>>
>>>> I have enabled debug and my logs flood with the following. I am going
>>>> to try some stuff from your provided mailing list and see..
>>>>
>>>> root@ceph1:~# tail -f
>>>> /var/log/ceph/f270ad9e-1f6f-11ed-b6f8-a539d87379ea/ceph.cephadm.log
>>>> 2022-09-02T18:38:21.754391+0000 mgr.ceph2.huidoh (mgr.344392) 211198 :
>>>> cephadm [DBG] 0 OSDs are scheduled for removal: []
>>>> 2022-09-02T18:38:21.754519+0000 mgr.ceph2.huidoh (mgr.344392) 211199 :
>>>> cephadm [DBG] Saving [] to store
>>>> 2022-09-02T18:38:21.757155+0000 mgr.ceph2.huidoh (mgr.344392) 211200 :
>>>> cephadm [DBG] refreshing hosts and daemons
>>>> 2022-09-02T18:38:21.758065+0000 mgr.ceph2.huidoh (mgr.344392) 211201 :
>>>> cephadm [DBG] _check_for_strays
>>>> 2022-09-02T18:38:21.758334+0000 mgr.ceph2.huidoh (mgr.344392) 211202 :
>>>> cephadm [DBG] 0 OSDs are scheduled for removal: []
>>>> 2022-09-02T18:38:21.758455+0000 mgr.ceph2.huidoh (mgr.344392) 211203 :
>>>> cephadm [DBG] Saving [] to store
>>>> 2022-09-02T18:38:21.761001+0000 mgr.ceph2.huidoh (mgr.344392) 211204 :
>>>> cephadm [DBG] refreshing hosts and daemons
>>>> 2022-09-02T18:38:21.762092+0000 mgr.ceph2.huidoh (mgr.344392) 211205 :
>>>> cephadm [DBG] _check_for_strays
>>>> 2022-09-02T18:38:21.762357+0000 mgr.ceph2.huidoh (mgr.344392) 211206 :
>>>> cephadm [DBG] 0 OSDs are scheduled for removal: []
>>>> 2022-09-02T18:38:21.762480+0000 mgr.ceph2.huidoh (mgr.344392) 211207 :
>>>> cephadm [DBG] Saving [] to store
>>>>
>>>> On Fri, Sep 2, 2022 at 12:17 PM Adam King <adking@xxxxxxxxxx> wrote:
>>>>
>>>>> hmm, okay. It seems like cephadm is stuck in general rather than an
>>>>> issue specific to the upgrade. I'd first make sure the orchestrator isn't
>>>>> paused (just running "ceph orch resume" should be enough, it's idempotent).
>>>>>
>>>>> Beyond that, there was someone else who had an issue with things
>>>>> getting stuck that was resolved in this thread
>>>>> https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/NKKLV5TMHFA3ERGCMJ3M7BVLA5PGIR4M/#NKKLV5TMHFA3ERGCMJ3M7BVLA5PGIR4M
>>>>> <https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/NKKLV5TMHFA3ERGCMJ3M7BVLA5PGIR4M/#NKKLV5TMHFA3ERGCMJ3M7BVLA5PGIR4M> that
>>>>> might be worth a look.
>>>>>
>>>>> If you haven't already, it's possible stopping the upgrade is a good
>>>>> idea, as maybe that's interfering with it getting to the point where it
>>>>> does the redeploy.
>>>>>
>>>>> If none of those help, it might be worth setting the log level to
>>>>> debug and seeing where things are ending up ("ceph config set mgr
>>>>> mgr/cephadm/log_to_cluster_level debug; ceph orch ps --refresh" then
>>>>> waiting a few minutes before running "ceph log last 100 debug cephadm" (not
>>>>> 100% on format of that command, if it fails try just "ceph log last
>>>>> cephadm"). We could maybe get more info on why it's not performing the
>>>>> redeploy from those debug logs. Just remember to set the log level back
>>>>> after 'ceph config set mgr mgr/cephadm/log_to_cluster_level info' as debug
>>>>> logs are quite verbose.
>>>>>
>>>>> On Fri, Sep 2, 2022 at 11:39 AM Satish Patel <satish.txt@xxxxxxxxx>
>>>>> wrote:
>>>>>
>>>>>> Hi Adam,
>>>>>>
>>>>>> As you said, i did following
>>>>>>
>>>>>> $ ceph orch daemon redeploy mgr.ceph1.smfvfd
>>>>>> quay.io/ceph/ceph:v16.2.10
>>>>>>
>>>>>> Noticed following line in logs but then no activity nothing, still
>>>>>> standby mgr running in older version
>>>>>>
>>>>>> 2022-09-02T15:35:45.753093+0000 mgr.ceph2.huidoh (mgr.344392) 2226 :
>>>>>> cephadm [INF] Schedule redeploy daemon mgr.ceph1.smfvfd
>>>>>> 2022-09-02T15:36:17.279190+0000 mgr.ceph2.huidoh (mgr.344392) 2245 :
>>>>>> cephadm [INF] refreshing ceph2 facts
>>>>>> 2022-09-02T15:36:17.984478+0000 mgr.ceph2.huidoh (mgr.344392) 2246 :
>>>>>> cephadm [INF] refreshing ceph1 facts
>>>>>> 2022-09-02T15:37:17.663730+0000 mgr.ceph2.huidoh (mgr.344392) 2284 :
>>>>>> cephadm [INF] refreshing ceph2 facts
>>>>>> 2022-09-02T15:37:18.386586+0000 mgr.ceph2.huidoh (mgr.344392) 2285 :
>>>>>> cephadm [INF] refreshing ceph1 facts
>>>>>>
>>>>>> I am not seeing any image get downloaded also
>>>>>>
>>>>>> root@ceph1:~# docker image ls
>>>>>> REPOSITORY                         TAG       IMAGE ID       CREATED
>>>>>>       SIZE
>>>>>> quay.io/ceph/ceph                  v15       93146564743f   3 weeks
>>>>>> ago     1.2GB
>>>>>> quay.io/ceph/ceph-grafana          8.3.5     dad864ee21e9   4 months
>>>>>> ago    558MB
>>>>>> quay.io/prometheus/prometheus      v2.33.4   514e6a882f6e   6 months
>>>>>> ago    204MB
>>>>>> quay.io/prometheus/alertmanager    v0.23.0   ba2b418f427c   12
>>>>>> months ago   57.5MB
>>>>>> quay.io/ceph/ceph-grafana          6.7.4     557c83e11646   13
>>>>>> months ago   486MB
>>>>>> quay.io/prometheus/prometheus      v2.18.1   de242295e225   2 years
>>>>>> ago     140MB
>>>>>> quay.io/prometheus/alertmanager    v0.20.0   0881eb8f169f   2 years
>>>>>> ago     52.1MB
>>>>>> quay.io/prometheus/node-exporter   v0.18.1   e5a616e4b9cf   3 years
>>>>>> ago     22.9MB
>>>>>>
>>>>>>
>>>>>> On Fri, Sep 2, 2022 at 11:06 AM Adam King <adking@xxxxxxxxxx> wrote:
>>>>>>
>>>>>>> hmm, at this point, maybe we should just try manually upgrading the
>>>>>>> mgr daemons and then move from there. First, just stop the upgrade "ceph
>>>>>>> orch upgrade stop". If you figure out which of the two mgr daemons is the
>>>>>>> standby (it should say which one is active in "ceph -s" output) and then do
>>>>>>> a "ceph orch daemon redeploy <standby-mgr-name>
>>>>>>> quay.io/ceph/ceph:v16.2.10" it should redeploy that specific mgr
>>>>>>> with the new version. You could then do a "ceph mgr fail" to swap which of
>>>>>>> the mgr daemons is active, then do another "ceph orch daemon redeploy
>>>>>>> <standby-mgr-name> quay.io/ceph/ceph:v16.2.10" where the standby is
>>>>>>> now the other mgr still on 15.2.17. Once the mgr daemons are both upgraded
>>>>>>> to the new version, run a "ceph orch redeploy mgr" and then "ceph orch
>>>>>>> upgrade start --image quay.io/ceph/ceph:v16.2.10" and see if it
>>>>>>> goes better.
>>>>>>>
>>>>>>> On Fri, Sep 2, 2022 at 10:36 AM Satish Patel <satish.txt@xxxxxxxxx>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Adam,
>>>>>>>>
>>>>>>>> I run the following command to upgrade but it looks like nothing is
>>>>>>>> happening
>>>>>>>>
>>>>>>>> $ ceph orch upgrade start --image quay.io/ceph/ceph:v16.2.10
>>>>>>>>
>>>>>>>> Status message is empty..
>>>>>>>>
>>>>>>>> root@ceph1:~# ceph orch upgrade status
>>>>>>>> {
>>>>>>>>     "target_image": "quay.io/ceph/ceph:v16.2.10",
>>>>>>>>     "in_progress": true,
>>>>>>>>     "services_complete": [],
>>>>>>>>     "message": ""
>>>>>>>> }
>>>>>>>>
>>>>>>>> Nothing in Logs
>>>>>>>>
>>>>>>>> root@ceph1:~# tail -f
>>>>>>>> /var/log/ceph/f270ad9e-1f6f-11ed-b6f8-a539d87379ea/ceph.cephadm.log
>>>>>>>> 2022-09-02T14:31:52.597661+0000 mgr.ceph2.huidoh (mgr.344392) 174 :
>>>>>>>> cephadm [INF] refreshing ceph2 facts
>>>>>>>> 2022-09-02T14:31:52.991450+0000 mgr.ceph2.huidoh (mgr.344392) 176 :
>>>>>>>> cephadm [INF] refreshing ceph1 facts
>>>>>>>> 2022-09-02T14:32:52.965092+0000 mgr.ceph2.huidoh (mgr.344392) 207 :
>>>>>>>> cephadm [INF] refreshing ceph2 facts
>>>>>>>> 2022-09-02T14:32:53.369789+0000 mgr.ceph2.huidoh (mgr.344392) 208 :
>>>>>>>> cephadm [INF] refreshing ceph1 facts
>>>>>>>> 2022-09-02T14:33:53.367986+0000 mgr.ceph2.huidoh (mgr.344392) 239 :
>>>>>>>> cephadm [INF] refreshing ceph2 facts
>>>>>>>> 2022-09-02T14:33:53.760427+0000 mgr.ceph2.huidoh (mgr.344392) 240 :
>>>>>>>> cephadm [INF] refreshing ceph1 facts
>>>>>>>> 2022-09-02T14:34:53.754277+0000 mgr.ceph2.huidoh (mgr.344392) 272 :
>>>>>>>> cephadm [INF] refreshing ceph2 facts
>>>>>>>> 2022-09-02T14:34:54.162503+0000 mgr.ceph2.huidoh (mgr.344392) 273 :
>>>>>>>> cephadm [INF] refreshing ceph1 facts
>>>>>>>> 2022-09-02T14:35:54.133467+0000 mgr.ceph2.huidoh (mgr.344392) 305 :
>>>>>>>> cephadm [INF] refreshing ceph2 facts
>>>>>>>> 2022-09-02T14:35:54.522171+0000 mgr.ceph2.huidoh (mgr.344392) 306 :
>>>>>>>> cephadm [INF] refreshing ceph1 facts
>>>>>>>>
>>>>>>>> In progress that mesg stuck there for long time
>>>>>>>>
>>>>>>>> root@ceph1:~# ceph -s
>>>>>>>>   cluster:
>>>>>>>>     id:     f270ad9e-1f6f-11ed-b6f8-a539d87379ea
>>>>>>>>     health: HEALTH_OK
>>>>>>>>
>>>>>>>>   services:
>>>>>>>>     mon: 1 daemons, quorum ceph1 (age 9h)
>>>>>>>>     mgr: ceph2.huidoh(active, since 9m), standbys: ceph1.smfvfd
>>>>>>>>     osd: 4 osds: 4 up (since 9h), 4 in (since 11h)
>>>>>>>>
>>>>>>>>   data:
>>>>>>>>     pools:   5 pools, 129 pgs
>>>>>>>>     objects: 20.06k objects, 83 GiB
>>>>>>>>     usage:   168 GiB used, 632 GiB / 800 GiB avail
>>>>>>>>     pgs:     129 active+clean
>>>>>>>>
>>>>>>>>   io:
>>>>>>>>     client:   12 KiB/s wr, 0 op/s rd, 1 op/s wr
>>>>>>>>
>>>>>>>>   progress:
>>>>>>>>     Upgrade to quay.io/ceph/ceph:v16.2.10 (0s)
>>>>>>>>       [............................]
>>>>>>>>
>>>>>>>> On Fri, Sep 2, 2022 at 10:25 AM Satish Patel <satish.txt@xxxxxxxxx>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> It Looks like I did it with the following command.
>>>>>>>>>
>>>>>>>>> $ ceph orch daemon add mgr ceph2:10.73.0.192
>>>>>>>>>
>>>>>>>>> Now i can see two with same version 15.x
>>>>>>>>>
>>>>>>>>> root@ceph1:~# ceph orch ps --daemon-type mgr
>>>>>>>>> NAME              HOST   STATUS         REFRESHED  AGE  VERSION
>>>>>>>>>  IMAGE NAME
>>>>>>>>>                 IMAGE ID      CONTAINER ID
>>>>>>>>> mgr.ceph1.smfvfd  ceph1  running (8h)   41s ago    8h   15.2.17
>>>>>>>>> quay.io/ceph/ceph@sha256:c08064dde4bba4e72a1f55d90ca32df9ef5aafab82efe2e0a0722444a5aaacca
>>>>>>>>>  93146564743f  1aab837306d2
>>>>>>>>> mgr.ceph2.huidoh  ceph2  running (60s)  110s ago   60s  15.2.17
>>>>>>>>> quay.io/ceph/ceph@sha256:c08064dde4bba4e72a1f55d90ca32df9ef5aafab82efe2e0a0722444a5aaacca
>>>>>>>>>  93146564743f  294fd6ab6c97
>>>>>>>>>
>>>>>>>>> On Fri, Sep 2, 2022 at 10:19 AM Satish Patel <satish.txt@xxxxxxxxx>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Let's come back to the original question: how to bring back the
>>>>>>>>>> second mgr?
>>>>>>>>>>
>>>>>>>>>> root@ceph1:~# ceph orch apply mgr 2
>>>>>>>>>> Scheduled mgr update...
>>>>>>>>>>
>>>>>>>>>> Nothing happened with above command, logs saying nothing
>>>>>>>>>>
>>>>>>>>>> 2022-09-02T14:16:20.407927+0000 mgr.ceph1.smfvfd (mgr.334626)
>>>>>>>>>> 16939 : cephadm [INF] refreshing ceph2 facts
>>>>>>>>>> 2022-09-02T14:16:40.247195+0000 mgr.ceph1.smfvfd (mgr.334626)
>>>>>>>>>> 16952 : cephadm [INF] Saving service mgr spec with placement count:2
>>>>>>>>>> 2022-09-02T14:16:53.106919+0000 mgr.ceph1.smfvfd (mgr.334626)
>>>>>>>>>> 16961 : cephadm [INF] Saving service mgr spec with placement count:2
>>>>>>>>>> 2022-09-02T14:17:19.135203+0000 mgr.ceph1.smfvfd (mgr.334626)
>>>>>>>>>> 16975 : cephadm [INF] refreshing ceph1 facts
>>>>>>>>>> 2022-09-02T14:17:20.780496+0000 mgr.ceph1.smfvfd (mgr.334626)
>>>>>>>>>> 16977 : cephadm [INF] refreshing ceph2 facts
>>>>>>>>>> 2022-09-02T14:18:19.502034+0000 mgr.ceph1.smfvfd (mgr.334626)
>>>>>>>>>> 17008 : cephadm [INF] refreshing ceph1 facts
>>>>>>>>>> 2022-09-02T14:18:21.127973+0000 mgr.ceph1.smfvfd (mgr.334626)
>>>>>>>>>> 17010 : cephadm [INF] refreshing ceph2 facts
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Sep 2, 2022 at 10:15 AM Satish Patel <
>>>>>>>>>> satish.txt@xxxxxxxxx> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Adam,
>>>>>>>>>>>
>>>>>>>>>>> Wait..wait.. now it's working suddenly without doing anything..
>>>>>>>>>>> very odd
>>>>>>>>>>>
>>>>>>>>>>> root@ceph1:~# ceph orch ls
>>>>>>>>>>> NAME                  RUNNING  REFRESHED  AGE  PLACEMENT
>>>>>>>>>>>  IMAGE NAME
>>>>>>>>>>>                 IMAGE ID
>>>>>>>>>>> alertmanager              1/1  5s ago     2w   count:1
>>>>>>>>>>> quay.io/prometheus/alertmanager:v0.20.0
>>>>>>>>>>>                            0881eb8f169f
>>>>>>>>>>> crash                     2/2  5s ago     2w   *
>>>>>>>>>>> quay.io/ceph/ceph:v15
>>>>>>>>>>>                            93146564743f
>>>>>>>>>>> grafana                   1/1  5s ago     2w   count:1
>>>>>>>>>>> quay.io/ceph/ceph-grafana:6.7.4
>>>>>>>>>>>                            557c83e11646
>>>>>>>>>>> mgr                       1/2  5s ago     8h   count:2
>>>>>>>>>>> quay.io/ceph/ceph@sha256:c08064dde4bba4e72a1f55d90ca32df9ef5aafab82efe2e0a0722444a5aaacca
>>>>>>>>>>>  93146564743f
>>>>>>>>>>> mon                       1/2  5s ago     8h   ceph1;ceph2
>>>>>>>>>>> quay.io/ceph/ceph:v15
>>>>>>>>>>>                            93146564743f
>>>>>>>>>>> node-exporter             2/2  5s ago     2w   *
>>>>>>>>>>> quay.io/prometheus/node-exporter:v0.18.1
>>>>>>>>>>>                             e5a616e4b9cf
>>>>>>>>>>> osd.osd_spec_default      4/0  5s ago     -    <unmanaged>
>>>>>>>>>>> quay.io/ceph/ceph:v15
>>>>>>>>>>>                            93146564743f
>>>>>>>>>>> prometheus                1/1  5s ago     2w   count:1
>>>>>>>>>>> quay.io/prometheus/prometheus:v2.18.1
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Sep 2, 2022 at 10:13 AM Satish Patel <
>>>>>>>>>>> satish.txt@xxxxxxxxx> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I can see that in the output but I'm not sure how to get rid of
>>>>>>>>>>>> it.
>>>>>>>>>>>>
>>>>>>>>>>>> root@ceph1:~# ceph orch ps --refresh
>>>>>>>>>>>> NAME
>>>>>>>>>>>>            HOST   STATUS        REFRESHED  AGE  VERSION    IMAGE NAME
>>>>>>>>>>>>
>>>>>>>>>>>> IMAGE ID      CONTAINER ID
>>>>>>>>>>>> alertmanager.ceph1
>>>>>>>>>>>>            ceph1  running (9h)  64s ago    2w   0.20.0
>>>>>>>>>>>> quay.io/prometheus/alertmanager:v0.20.0
>>>>>>>>>>>>                              0881eb8f169f  ba804b555378
>>>>>>>>>>>> cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d
>>>>>>>>>>>>  ceph2  stopped       65s ago    -    <unknown>  <unknown>
>>>>>>>>>>>>                                                                  <unknown>
>>>>>>>>>>>>     <unknown>
>>>>>>>>>>>> crash.ceph1
>>>>>>>>>>>>           ceph1  running (9h)  64s ago    2w   15.2.17
>>>>>>>>>>>> quay.io/ceph/ceph:v15
>>>>>>>>>>>>                              93146564743f  a3a431d834fc
>>>>>>>>>>>> crash.ceph2
>>>>>>>>>>>>           ceph2  running (9h)  65s ago    13d  15.2.17
>>>>>>>>>>>> quay.io/ceph/ceph:v15
>>>>>>>>>>>>                              93146564743f  3c963693ff2b
>>>>>>>>>>>> grafana.ceph1
>>>>>>>>>>>>           ceph1  running (9h)  64s ago    2w   6.7.4
>>>>>>>>>>>> quay.io/ceph/ceph-grafana:6.7.4
>>>>>>>>>>>>                              557c83e11646  7583a8dc4c61
>>>>>>>>>>>> mgr.ceph1.smfvfd
>>>>>>>>>>>>            ceph1  running (8h)  64s ago    8h   15.2.17
>>>>>>>>>>>> quay.io/ceph/ceph@sha256:c08064dde4bba4e72a1f55d90ca32df9ef5aafab82efe2e0a0722444a5aaacca
>>>>>>>>>>>>  93146564743f  1aab837306d2
>>>>>>>>>>>> mon.ceph1
>>>>>>>>>>>>           ceph1  running (9h)  64s ago    2w   15.2.17
>>>>>>>>>>>> quay.io/ceph/ceph:v15
>>>>>>>>>>>>                              93146564743f  c1d155d8c7ad
>>>>>>>>>>>> node-exporter.ceph1
>>>>>>>>>>>>           ceph1  running (9h)  64s ago    2w   0.18.1
>>>>>>>>>>>> quay.io/prometheus/node-exporter:v0.18.1
>>>>>>>>>>>>                             e5a616e4b9cf  2ff235fe0e42
>>>>>>>>>>>> node-exporter.ceph2
>>>>>>>>>>>>           ceph2  running (9h)  65s ago    13d  0.18.1
>>>>>>>>>>>> quay.io/prometheus/node-exporter:v0.18.1
>>>>>>>>>>>>                             e5a616e4b9cf  17678b9ba602
>>>>>>>>>>>> osd.0
>>>>>>>>>>>>           ceph1  running (9h)  64s ago    13d  15.2.17
>>>>>>>>>>>> quay.io/ceph/ceph:v15
>>>>>>>>>>>>                              93146564743f  d0fd73b777a3
>>>>>>>>>>>> osd.1
>>>>>>>>>>>>           ceph1  running (9h)  64s ago    13d  15.2.17
>>>>>>>>>>>> quay.io/ceph/ceph:v15
>>>>>>>>>>>>                              93146564743f  049120e83102
>>>>>>>>>>>> osd.2
>>>>>>>>>>>>           ceph2  running (9h)  65s ago    13d  15.2.17
>>>>>>>>>>>> quay.io/ceph/ceph:v15
>>>>>>>>>>>>                              93146564743f  8700e8cefd1f
>>>>>>>>>>>> osd.3
>>>>>>>>>>>>           ceph2  running (9h)  65s ago    13d  15.2.17
>>>>>>>>>>>> quay.io/ceph/ceph:v15
>>>>>>>>>>>>                              93146564743f  9c71bc87ed16
>>>>>>>>>>>> prometheus.ceph1
>>>>>>>>>>>>            ceph1  running (9h)  64s ago    2w   2.18.1
>>>>>>>>>>>> quay.io/prometheus/prometheus:v2.18.1
>>>>>>>>>>>>                              de242295e225  74a538efd61e
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Sep 2, 2022 at 10:10 AM Adam King <adking@xxxxxxxxxx>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> maybe also a "ceph orch ps --refresh"? It might still have the
>>>>>>>>>>>>> old cached daemon inventory from before you remove the files.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Sep 2, 2022 at 9:57 AM Satish Patel <
>>>>>>>>>>>>> satish.txt@xxxxxxxxx> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Adam,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I have deleted file located here - rm
>>>>>>>>>>>>>> /var/lib/ceph/f270ad9e-1f6f-11ed-b6f8-a539d87379ea/cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> But still getting the same error, do i need to do anything
>>>>>>>>>>>>>> else?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Sep 2, 2022 at 9:51 AM Adam King <adking@xxxxxxxxxx>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Okay, I'm wondering if this is an issue with version
>>>>>>>>>>>>>>> mismatch. Having previously had a 16.2.10 mgr and then now having a 15.2.17
>>>>>>>>>>>>>>> one that doesn't expect this sort of thing to be present. Either way, I'd
>>>>>>>>>>>>>>> think just deleting this cephadm.
>>>>>>>>>>>>>>> 7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d
>>>>>>>>>>>>>>> (and any others like it) file would be the way forward to
>>>>>>>>>>>>>>> get orch ls working again.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, Sep 2, 2022 at 9:44 AM Satish Patel <
>>>>>>>>>>>>>>> satish.txt@xxxxxxxxx> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Adam,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> In cephadm ls i found the following service but i believe
>>>>>>>>>>>>>>>> it was there before also.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>>>         "style": "cephadm:v1",
>>>>>>>>>>>>>>>>         "name":
>>>>>>>>>>>>>>>> "cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d",
>>>>>>>>>>>>>>>>         "fsid": "f270ad9e-1f6f-11ed-b6f8-a539d87379ea",
>>>>>>>>>>>>>>>>         "systemd_unit":
>>>>>>>>>>>>>>>> "ceph-f270ad9e-1f6f-11ed-b6f8-a539d87379ea@cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d
>>>>>>>>>>>>>>>> ",
>>>>>>>>>>>>>>>>         "enabled": false,
>>>>>>>>>>>>>>>>         "state": "stopped",
>>>>>>>>>>>>>>>>         "container_id": null,
>>>>>>>>>>>>>>>>         "container_image_name": null,
>>>>>>>>>>>>>>>>         "container_image_id": null,
>>>>>>>>>>>>>>>>         "version": null,
>>>>>>>>>>>>>>>>         "started": null,
>>>>>>>>>>>>>>>>         "created": null,
>>>>>>>>>>>>>>>>         "deployed": null,
>>>>>>>>>>>>>>>>         "configured": null
>>>>>>>>>>>>>>>>     },
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Look like remove didn't work
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> root@ceph1:~# ceph orch rm cephadm
>>>>>>>>>>>>>>>> Failed to remove service. <cephadm> was not found.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> root@ceph1:~# ceph orch rm
>>>>>>>>>>>>>>>> cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d
>>>>>>>>>>>>>>>> Failed to remove service.
>>>>>>>>>>>>>>>> <cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d>
>>>>>>>>>>>>>>>> was not found.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Fri, Sep 2, 2022 at 8:27 AM Adam King <adking@xxxxxxxxxx>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> this looks like an old traceback you would get if you
>>>>>>>>>>>>>>>>> ended up with a service type that shouldn't be there somehow. The things
>>>>>>>>>>>>>>>>> I'd probably check are that "cephadm ls" on either host definitely doesn't
>>>>>>>>>>>>>>>>> report and strange things that aren't actually daemons in your cluster such
>>>>>>>>>>>>>>>>> as "cephadm.<hash>". Another thing you could maybe try, as I believe the
>>>>>>>>>>>>>>>>> assertion it's giving is for an unknown service type here ("AssertionError:
>>>>>>>>>>>>>>>>> cephadm"), is just "ceph orch rm cephadm" which would maybe cause it to
>>>>>>>>>>>>>>>>> remove whatever it thinks is this "cephadm" service that it has deployed.
>>>>>>>>>>>>>>>>> Lastly, you could try having the mgr you manually deploy be a 16.2.10 one
>>>>>>>>>>>>>>>>> instead of 15.2.17 (I'm assuming here, but the line numbers in that
>>>>>>>>>>>>>>>>> traceback suggest octopus). The 16.2.10 one is just much less likely to
>>>>>>>>>>>>>>>>> have a bug that causes something like this.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Fri, Sep 2, 2022 at 1:41 AM Satish Patel <
>>>>>>>>>>>>>>>>> satish.txt@xxxxxxxxx> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Now when I run "ceph orch ps" it works but the following
>>>>>>>>>>>>>>>>>> command throws an
>>>>>>>>>>>>>>>>>> error.  Trying to bring up second mgr using ceph orch
>>>>>>>>>>>>>>>>>> apply mgr command but
>>>>>>>>>>>>>>>>>> didn't help
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> root@ceph1:/ceph-disk# ceph version
>>>>>>>>>>>>>>>>>> ceph version 15.2.17
>>>>>>>>>>>>>>>>>> (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus
>>>>>>>>>>>>>>>>>> (stable)
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> root@ceph1:/ceph-disk# ceph orch ls
>>>>>>>>>>>>>>>>>> Error EINVAL: Traceback (most recent call last):
>>>>>>>>>>>>>>>>>>   File "/usr/share/ceph/mgr/mgr_module.py", line 1212, in
>>>>>>>>>>>>>>>>>> _handle_command
>>>>>>>>>>>>>>>>>>     return self.handle_command(inbuf, cmd)
>>>>>>>>>>>>>>>>>>   File "/usr/share/ceph/mgr/orchestrator/_interface.py",
>>>>>>>>>>>>>>>>>> line 140, in
>>>>>>>>>>>>>>>>>> handle_command
>>>>>>>>>>>>>>>>>>     return dispatch[cmd['prefix']].call(self, cmd, inbuf)
>>>>>>>>>>>>>>>>>>   File "/usr/share/ceph/mgr/mgr_module.py", line 320, in
>>>>>>>>>>>>>>>>>> call
>>>>>>>>>>>>>>>>>>     return self.func(mgr, **kwargs)
>>>>>>>>>>>>>>>>>>   File "/usr/share/ceph/mgr/orchestrator/_interface.py",
>>>>>>>>>>>>>>>>>> line 102, in
>>>>>>>>>>>>>>>>>> <lambda>
>>>>>>>>>>>>>>>>>>     wrapper_copy = lambda *l_args, **l_kwargs:
>>>>>>>>>>>>>>>>>> wrapper(*l_args, **l_kwargs)
>>>>>>>>>>>>>>>>>>   File "/usr/share/ceph/mgr/orchestrator/_interface.py",
>>>>>>>>>>>>>>>>>> line 91, in wrapper
>>>>>>>>>>>>>>>>>>     return func(*args, **kwargs)
>>>>>>>>>>>>>>>>>>   File "/usr/share/ceph/mgr/orchestrator/module.py", line
>>>>>>>>>>>>>>>>>> 503, in
>>>>>>>>>>>>>>>>>> _list_services
>>>>>>>>>>>>>>>>>>     raise_if_exception(completion)
>>>>>>>>>>>>>>>>>>   File "/usr/share/ceph/mgr/orchestrator/_interface.py",
>>>>>>>>>>>>>>>>>> line 642, in
>>>>>>>>>>>>>>>>>> raise_if_exception
>>>>>>>>>>>>>>>>>>     raise e
>>>>>>>>>>>>>>>>>> AssertionError: cephadm
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Fri, Sep 2, 2022 at 1:32 AM Satish Patel <
>>>>>>>>>>>>>>>>>> satish.txt@xxxxxxxxx> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> > nevermind, i found doc related that and i am able to
>>>>>>>>>>>>>>>>>> get 1 mgr up -
>>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>>> https://docs.ceph.com/en/quincy/cephadm/troubleshooting/#manually-deploying-a-mgr-daemon
>>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>>> > On Fri, Sep 2, 2022 at 1:21 AM Satish Patel <
>>>>>>>>>>>>>>>>>> satish.txt@xxxxxxxxx> wrote:
>>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>>> >> Folks,
>>>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>>>> >> I am having little fun time with cephadm and it's very
>>>>>>>>>>>>>>>>>> annoying to deal
>>>>>>>>>>>>>>>>>> >> with it
>>>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>>>> >> I have deployed a ceph cluster using cephadm on two
>>>>>>>>>>>>>>>>>> nodes. Now when i was
>>>>>>>>>>>>>>>>>> >> trying to upgrade and noticed hiccups where it just
>>>>>>>>>>>>>>>>>> upgraded a single mgr
>>>>>>>>>>>>>>>>>> >> with 16.2.10 but not other so i started messing around
>>>>>>>>>>>>>>>>>> and somehow I
>>>>>>>>>>>>>>>>>> >> deleted both mgr in the thought that cephadm will
>>>>>>>>>>>>>>>>>> recreate them.
>>>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>>>> >> Now i don't have any single mgr so my ceph orch
>>>>>>>>>>>>>>>>>> command hangs forever and
>>>>>>>>>>>>>>>>>> >> looks like a chicken egg issue.
>>>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>>>> >> How do I recover from this? If I can't run the ceph
>>>>>>>>>>>>>>>>>> orch command, I won't
>>>>>>>>>>>>>>>>>> >> be able to redeploy my mgr daemons.
>>>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>>>> >> I am not able to find any mgr in the following command
>>>>>>>>>>>>>>>>>> on both nodes.
>>>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>>>> >> $ cephadm ls | grep mgr
>>>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>>>>>>>>>>>>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx