Re: [cephadm] Found duplicate OSDs

Satish Patel <satish.txt@xxxxxxxxx> · Fri, 2 Sep 2022 00:13:57 -0400

Great, thanks!

Don't ask me how many commands I have typed to fix my issue. Finally I did
it. Basically i fix /etc/hosts and then i remove mgr service using
following command

ceph orch daemon rm mgr.ceph1.xmbvsb

And cephadm auto deployed a new working mgr.  I found ceph orch ps was
hanging and the solution I found was to restart all ceph daemon using
( systemctl restart ceph.target ) command.

root@ceph1:/ceph-disk# ceph orch ps
NAME                 HOST   PORTS        STATUS         REFRESHED  AGE  MEM
USE  MEM LIM  VERSION  IMAGE ID      CONTAINER ID
alertmanager.ceph1   ceph1               running (12m)     9m ago   2w
 16.0M        -  0.20.0   0881eb8f169f  d064a0177439
crash.ceph1          ceph1               running (49m)     9m ago   2w
 7963k        -  15.2.17  93146564743f  550b088467e4
crash.ceph2          ceph2               running (35m)     9m ago  13d
 7287k        -  15.2.17  93146564743f  c4b5b3327fa5
grafana.ceph1        ceph1               running (14m)     9m ago   2w
 34.9M        -  6.7.4    557c83e11646  46048ebff031
mgr.ceph1.hxsfrs     ceph1  *:8443,9283  running (13m)     9m ago  13m
327M        -  15.2.17  93146564743f  4c5169890e9d
mgr.ceph2.hmbdla     ceph2               running (35m)     9m ago  13d
435M        -  16.2.10  0d668911f040  361d58a423cd
mon.ceph1            ceph1               running (49m)     9m ago   2w
 85.5M    2048M  15.2.17  93146564743f  a5f055953256
node-exporter.ceph1  ceph1               running (14m)     9m ago   2w
 32.9M        -  0.18.1   e5a616e4b9cf  833cc2e6c9ed
node-exporter.ceph2  ceph2               running (13m)     9m ago  13d
 33.9M        -  0.18.1   e5a616e4b9cf  30d15dde3860
osd.0                ceph1               running (49m)     9m ago  13d
355M    4096M  15.2.17  93146564743f  6e9bee5c211e
osd.1                ceph1               running (49m)     9m ago  13d
372M    4096M  15.2.17  93146564743f  09b8616bc096
osd.2                ceph2               running (35m)     9m ago  13d
287M    4096M  15.2.17  93146564743f  20f75a1b5221
osd.3                ceph2               running (35m)     9m ago  13d
300M    4096M  15.2.17  93146564743f  c57154355b03
prometheus.ceph1     ceph1               running (12m)     9m ago   2w
 89.5M        -  2.18.1   de242295e225  b5ff35307ac0

Now I am going to start an upgrade process next. I will keep you posted to
see how it goes.

On Thu, Sep 1, 2022 at 10:06 PM Adam King <adking@xxxxxxxxxx> wrote:

> I'm not sure exactly what needs to be done to fix that, but I'd imagine
> just editing the /etc/hosts file on all your hosts to be correct would be
> the start (the cephadm shell would have taken its /etc/hosts off of
> whatever host you ran the shell from). Unfortunately I'm not much of a
> networking expert and if you have some sort of DNS stuff going on for your
> local network I'm not too sure what to do there, but if it's possible just
> fixing the /etc/hosts entries will resolve things. Either way, once you've
> got the networking fixed so ssh-ing to the hosts works as expected with the
> IPs you might need to  re-add one or both of the hosts to the cluster with
> the correct IP as well ( "ceph orch host add <hostname> <ip>"). I believe
> if you just run the orch host add command again with a different IP but the
> same hostname it will just change the IP cephadm has stored for the host.
> If that isn't working, running "ceph orch host rm <hostname> --force"
> beforehand should make it work (if you just remove the host with --force it
> shouldn't touch the host's daemons and should therefore be a relatively
> sage operation). In the end, the IP cephadm lists for each host in "ceph
> orch host ls" must be an IP that allows correctly ssh-ing to the host.
>
> On Thu, Sep 1, 2022 at 9:17 PM Satish Patel <satish.txt@xxxxxxxxx> wrote:
>
>> Hi Adam,
>>
>> You are correct, look like it was a naming issue in my /etc/hosts file.
>> Is there a way to correct it?
>>
>> If you see i have ceph1 two time. :(
>>
>> 10.73.0.191 ceph1.example.com ceph1
>> 10.73.0.192 ceph2.example.com ceph1
>>
>> On Thu, Sep 1, 2022 at 8:06 PM Adam King <adking@xxxxxxxxxx> wrote:
>>
>>> the naming for daemons is a bit different for each daemon type, but for
>>> mgr daemons it's always "mgr.<hostname>.<random-6-chars>".  The daemons
>>> cephadm will be able to find for something like a daemon redeploy are
>>> pretty much always whatever is reported in "ceph orch ps". Given that
>>> "mgr.ceph1.xmbvsb" isn't listed there, it's not surprising it said it
>>> couldn't find it.
>>>
>>> There is definitely something very odd going on here. It looks like the
>>> crash daemons as well are reporting a duplicate "crash.ceph2" on both ceph1
>>> and ceph2. Going back to your original orch ps output from the first email,
>>> it seems that every daemon seems to have a duplicate and none of the actual
>>> daemons listed in the "cephadm ls" on ceph1 are actually being reported in
>>> the orch ps output. I think something may have gone wrong with the host and
>>> networking setup here and it seems to be reporting ceph2 daemons as the
>>> daemons for both ceph1 and ceph2 as if trying to connect to ceph1 ends up
>>> connecting to ceph2. The only time I've seen anything like this was when I
>>> made a mistake and setup a virtual IP on one host that was the same as the
>>> actual IP for another host on the cluster and cephadm basically ended up
>>> ssh-ing to the same host via both IPs (the one that was supposed to be for
>>> host A and host B where the virtual IP matching host B was setup on host
>>> A). I doubt you're in that exact situation, but I think we need to look
>>> very closely at the networking setup here. I would try opening up a cephadm
>>> shell and ssh-ing to each of the two hosts by the IP listed in "ceph orch
>>> host ls" and make sure you actually get to the correct host and it has the
>>> correct hostname. Given the output, I wouldn't be surprised if trying to
>>> connect to ceph1's IP landed you on ceph2 or vice versa. I will say I found
>>> it a bit odd originally when I saw the two IPs were 10.73.0.192 and
>>> 10.73.3.192. There's nothing necessarily wrong with that, but typically IPs
>>> on the host are more likely to differ at the end than in the middle (e.g.
>>> 192.168.122.1 and 192.168.122.2 rather than 192.168.1.122 and
>>> 192.168.2.122) and it did make me wonder if a mistake had occurred in the
>>> networking. Either way, there's clearly something making it think ceph2's
>>> daemons are on both ceph1 and ceph2 and some sort of networking issue is
>>> the only thing I'm aware of currently that causes something like that.
>>>
>>> On Thu, Sep 1, 2022 at 6:30 PM Satish Patel <satish.txt@xxxxxxxxx>
>>> wrote:
>>>
>>>> Hi Adam,
>>>>
>>>> I have also noticed a very strange thing which is Duplicate name in the
>>>> following output.  Is this normal?  I don't know how it got here. Is there
>>>> a way I can rename them?
>>>>
>>>> root@ceph1:~# ceph orch ps
>>>> NAME                 HOST   PORTS        STATUS          REFRESHED  AGE
>>>>  MEM USE  MEM LIM  VERSION    IMAGE ID      CONTAINER ID
>>>> alertmanager.ceph1   ceph1  *:9093,9094  starting                -    -
>>>>        -        -  <unknown>  <unknown>     <unknown>
>>>> crash.ceph2          ceph1               running (13d)     10s ago  13d
>>>>    10.0M        -  15.2.17    93146564743f  0a009254afb0
>>>> crash.ceph2          ceph2               running (13d)     10s ago  13d
>>>>    10.0M        -  15.2.17    93146564743f  0a009254afb0
>>>> grafana.ceph1        ceph1  *:3000       starting                -    -
>>>>        -        -  <unknown>  <unknown>     <unknown>
>>>> mgr.ceph2.hmbdla     ceph1               running (103m)    10s ago  13d
>>>>     518M        -  16.2.10    0d668911f040  745245c18d5e
>>>> mgr.ceph2.hmbdla     ceph2               running (103m)    10s ago  13d
>>>>     518M        -  16.2.10    0d668911f040  745245c18d5e
>>>> node-exporter.ceph2  ceph1               running (7h)      10s ago  13d
>>>>    70.2M        -  0.18.1     e5a616e4b9cf  d0ba04bb977c
>>>> node-exporter.ceph2  ceph2               running (7h)      10s ago  13d
>>>>    70.2M        -  0.18.1     e5a616e4b9cf  d0ba04bb977c
>>>> osd.2                ceph1               running (19h)     10s ago  13d
>>>>     901M    4096M  15.2.17    93146564743f  e286fb1c6302
>>>> osd.2                ceph2               running (19h)     10s ago  13d
>>>>     901M    4096M  15.2.17    93146564743f  e286fb1c6302
>>>> osd.3                ceph1               running (19h)     10s ago  13d
>>>>    1006M    4096M  15.2.17    93146564743f  d3ae5d9f694f
>>>> osd.3                ceph2               running (19h)     10s ago  13d
>>>>    1006M    4096M  15.2.17    93146564743f  d3ae5d9f694f
>>>> osd.5                ceph1               running (19h)     10s ago   9d
>>>>     222M    4096M  15.2.17    93146564743f  405068fb474e
>>>> osd.5                ceph2               running (19h)     10s ago   9d
>>>>     222M    4096M  15.2.17    93146564743f  405068fb474e
>>>> prometheus.ceph1     ceph1  *:9095       running (15s)     10s ago  15s
>>>>    30.6M        -             514e6a882f6e  65a0acfed605
>>>> prometheus.ceph1     ceph2  *:9095       running (15s)     10s ago  15s
>>>>    30.6M        -             514e6a882f6e  65a0acfed605
>>>>
>>>> I found the following example link which has all different names, how
>>>> does cephadm decide naming?
>>>>
>>>>
>>>> https://achchusnulchikam.medium.com/deploy-ceph-cluster-with-cephadm-on-centos-8-257b300e7b42
>>>>
>>>> On Thu, Sep 1, 2022 at 6:20 PM Satish Patel <satish.txt@xxxxxxxxx>
>>>> wrote:
>>>>
>>>>> Hi Adam,
>>>>>
>>>>> Getting the following error, not sure why it's not able to find it.
>>>>>
>>>>> root@ceph1:~# ceph orch daemon redeploy mgr.ceph1.xmbvsb
>>>>> Error EINVAL: Unable to find mgr.ceph1.xmbvsb daemon(s)
>>>>>
>>>>> On Thu, Sep 1, 2022 at 5:57 PM Adam King <adking@xxxxxxxxxx> wrote:
>>>>>
>>>>>> what happens if you run `ceph orch daemon redeploy mgr.ceph1.xmbvsb`?
>>>>>>
>>>>>> On Thu, Sep 1, 2022 at 5:12 PM Satish Patel <satish.txt@xxxxxxxxx>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Adam,
>>>>>>>
>>>>>>> Here is requested output
>>>>>>>
>>>>>>> root@ceph1:~# ceph health detail
>>>>>>> HEALTH_WARN 4 stray daemon(s) not managed by cephadm
>>>>>>> [WRN] CEPHADM_STRAY_DAEMON: 4 stray daemon(s) not managed by cephadm
>>>>>>>     stray daemon mon.ceph1 on host ceph1 not managed by cephadm
>>>>>>>     stray daemon osd.0 on host ceph1 not managed by cephadm
>>>>>>>     stray daemon osd.1 on host ceph1 not managed by cephadm
>>>>>>>     stray daemon osd.4 on host ceph1 not managed by cephadm
>>>>>>>
>>>>>>>
>>>>>>> root@ceph1:~# ceph orch host ls
>>>>>>> HOST   ADDR         LABELS  STATUS
>>>>>>> ceph1  10.73.0.192
>>>>>>> ceph2  10.73.3.192  _admin
>>>>>>> 2 hosts in cluster
>>>>>>>
>>>>>>>
>>>>>>> My cephadm ls  saying mgr is in error state
>>>>>>>
>>>>>>> {
>>>>>>>         "style": "cephadm:v1",
>>>>>>>         "name": "mgr.ceph1.xmbvsb",
>>>>>>>         "fsid": "f270ad9e-1f6f-11ed-b6f8-a539d87379ea",
>>>>>>>         "systemd_unit":
>>>>>>> "ceph-f270ad9e-1f6f-11ed-b6f8-a539d87379ea@mgr.ceph1.xmbvsb",
>>>>>>>         "enabled": true,
>>>>>>>         "state": "error",
>>>>>>>         "container_id": null,
>>>>>>>         "container_image_name": "quay.io/ceph/ceph:v15",
>>>>>>>         "container_image_id": null,
>>>>>>>         "version": null,
>>>>>>>         "started": null,
>>>>>>>         "created": "2022-09-01T20:59:49.314347Z",
>>>>>>>         "deployed": "2022-09-01T20:59:48.718347Z",
>>>>>>>         "configured": "2022-09-01T20:59:49.314347Z"
>>>>>>>     },
>>>>>>>
>>>>>>>
>>>>>>> Getting error
>>>>>>>
>>>>>>> root@ceph1:~# cephadm unit --fsid
>>>>>>> f270ad9e-1f6f-11ed-b6f8-a539d87379ea --name mgr.ceph1.xmbvsb start
>>>>>>> stderr Job for
>>>>>>> ceph-f270ad9e-1f6f-11ed-b6f8-a539d87379ea@mgr.ceph1.xmbvsb.service
>>>>>>> failed because the control process exited with error code.
>>>>>>> stderr See "systemctl status
>>>>>>> ceph-f270ad9e-1f6f-11ed-b6f8-a539d87379ea@mgr.ceph1.xmbvsb.service"
>>>>>>> and "journalctl -xe" for details.
>>>>>>> Traceback (most recent call last):
>>>>>>>   File "/usr/sbin/cephadm", line 6250, in <module>
>>>>>>>     r = args.func()
>>>>>>>   File "/usr/sbin/cephadm", line 1357, in _infer_fsid
>>>>>>>     return func()
>>>>>>>   File "/usr/sbin/cephadm", line 3727, in command_unit
>>>>>>>     call_throws([
>>>>>>>   File "/usr/sbin/cephadm", line 1119, in call_throws
>>>>>>>     raise RuntimeError('Failed command: %s' % ' '.join(command))
>>>>>>> RuntimeError: Failed command: systemctl start
>>>>>>> ceph-f270ad9e-1f6f-11ed-b6f8-a539d87379ea@mgr.ceph1.xmbvsb
>>>>>>>
>>>>>>>
>>>>>>> How do I remove and re-deploy mgr?
>>>>>>>
>>>>>>> On Thu, Sep 1, 2022 at 4:54 PM Adam King <adking@xxxxxxxxxx> wrote:
>>>>>>>
>>>>>>>> cephadm deploys the containers with --rm so they will get removed
>>>>>>>> if you stop them. As for getting the 2nd mgr back, if it still lists the
>>>>>>>> 2nd one in `ceph orch ps` you should be able to do a `ceph orch daemon
>>>>>>>> redeploy <mgr-daemon-name>` where <mgr-daemon-name> should match the name
>>>>>>>> given in the orch ps output for the one that isn't actually up. If it isn't
>>>>>>>> listed there, given you have a count of 2, cephadm should deploy another
>>>>>>>> one. I do see in the orch ls output you posted that it says the mgr service
>>>>>>>> has "2/2" running which implies it believes a 2nd mgr is present (and you
>>>>>>>> would therefore be able to try the daemon redeploy if that daemon isn't
>>>>>>>> actually there).
>>>>>>>>
>>>>>>>> Is it still reporting the duplicate osds in orch ps? I see in the
>>>>>>>> cephadm ls output on ceph1 that osd.2 isn't being reported, which was
>>>>>>>> reported as being on ceph1 in the orch ps output in your original message
>>>>>>>> in this thread. I'm interested in what `ceph health detail` is reporting
>>>>>>>> now as well, as it says there are 4 stray daemons. Also, the `ceph orch
>>>>>>>> host ls` output just to get a better grasp of the topology of this cluster.
>>>>>>>>
>>>>>>>> On Thu, Sep 1, 2022 at 3:50 PM Satish Patel <satish.txt@xxxxxxxxx>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Adam,
>>>>>>>>>
>>>>>>>>> I have posted a question related to upgrading earlier and this
>>>>>>>>> thread is related to that, I have opened a new one because I found that
>>>>>>>>> error in logs and thought the upgrade may be stuck because of duplicate
>>>>>>>>> OSDs.
>>>>>>>>>
>>>>>>>>> root@ceph1:~# ls -l
>>>>>>>>> /var/lib/ceph/f270ad9e-1f6f-11ed-b6f8-a539d87379ea/
>>>>>>>>> total 44
>>>>>>>>> drwx------ 3 nobody nogroup 4096 Aug 19 05:37 alertmanager.ceph1
>>>>>>>>> drwx------ 3    167     167 4096 Aug 19 05:36 crash
>>>>>>>>> drwx------ 2    167     167 4096 Aug 19 05:37 crash.ceph1
>>>>>>>>> drwx------ 4    998     996 4096 Aug 19 05:37 grafana.ceph1
>>>>>>>>> drwx------ 2    167     167 4096 Aug 19 05:36 mgr.ceph1.xmbvsb
>>>>>>>>> drwx------ 3    167     167 4096 Aug 19 05:36 mon.ceph1
>>>>>>>>> drwx------ 2 nobody nogroup 4096 Aug 19 05:37 node-exporter.ceph1
>>>>>>>>> drwx------ 2    167     167 4096 Aug 19 17:55 osd.0
>>>>>>>>> drwx------ 2    167     167 4096 Aug 19 18:03 osd.1
>>>>>>>>> drwx------ 2    167     167 4096 Aug 31 05:20 osd.4
>>>>>>>>> drwx------ 4 nobody nogroup 4096 Aug 19 05:38 prometheus.ceph1
>>>>>>>>>
>>>>>>>>> Here is the output of cephadm ls
>>>>>>>>>
>>>>>>>>> root@ceph1:~# cephadm ls
>>>>>>>>> [
>>>>>>>>>     {
>>>>>>>>>         "style": "cephadm:v1",
>>>>>>>>>         "name": "alertmanager.ceph1",
>>>>>>>>>         "fsid": "f270ad9e-1f6f-11ed-b6f8-a539d87379ea",
>>>>>>>>>         "systemd_unit":
>>>>>>>>> "ceph-f270ad9e-1f6f-11ed-b6f8-a539d87379ea@alertmanager.ceph1",
>>>>>>>>>         "enabled": true,
>>>>>>>>>         "state": "running",
>>>>>>>>>         "container_id":
>>>>>>>>> "97403cf9799711461216b7f83e88c574da2b631c7c65233ebd82d8a216a48924",
>>>>>>>>>         "container_image_name": "
>>>>>>>>> quay.io/prometheus/alertmanager:v0.20.0",
>>>>>>>>>         "container_image_id":
>>>>>>>>> "0881eb8f169f5556a292b4e2c01d683172b12830a62a9225a98a8e206bb734f0",
>>>>>>>>>         "version": "0.20.0",
>>>>>>>>>         "started": "2022-08-19T16:59:02.461978Z",
>>>>>>>>>         "created": "2022-08-19T03:37:16.403605Z",
>>>>>>>>>         "deployed": "2022-08-19T03:37:15.815605Z",
>>>>>>>>>         "configured": "2022-08-19T16:59:02.117607Z"
>>>>>>>>>     },
>>>>>>>>>     {
>>>>>>>>>         "style": "cephadm:v1",
>>>>>>>>>         "name": "grafana.ceph1",
>>>>>>>>>         "fsid": "f270ad9e-1f6f-11ed-b6f8-a539d87379ea",
>>>>>>>>>         "systemd_unit":
>>>>>>>>> "ceph-f270ad9e-1f6f-11ed-b6f8-a539d87379ea@grafana.ceph1",
>>>>>>>>>         "enabled": true,
>>>>>>>>>         "state": "running",
>>>>>>>>>         "container_id":
>>>>>>>>> "c7136aea8349a37dd9b320acd926c4bcbed95bc4549779e9580ed4290edc2117",
>>>>>>>>>         "container_image_name": "quay.io/ceph/ceph-grafana:6.7.4",
>>>>>>>>>         "container_image_id":
>>>>>>>>> "557c83e11646f123a27b5e4b62ac6c45e7bb8b2e90d6044034d0db5b7019415c",
>>>>>>>>>         "version": "6.7.4",
>>>>>>>>>         "started": "2022-08-19T03:38:05.481992Z",
>>>>>>>>>         "created": "2022-08-19T03:37:46.823604Z",
>>>>>>>>>         "deployed": "2022-08-19T03:37:46.239604Z",
>>>>>>>>>         "configured": "2022-08-19T03:38:05.163603Z"
>>>>>>>>>     },
>>>>>>>>>     {
>>>>>>>>>         "style": "cephadm:v1",
>>>>>>>>>         "name": "osd.1",
>>>>>>>>>         "fsid": "f270ad9e-1f6f-11ed-b6f8-a539d87379ea",
>>>>>>>>>         "systemd_unit":
>>>>>>>>> "ceph-f270ad9e-1f6f-11ed-b6f8-a539d87379ea@osd.1",
>>>>>>>>>         "enabled": true,
>>>>>>>>>         "state": "running",
>>>>>>>>>         "container_id":
>>>>>>>>> "51586b775bda0485c8b27b8401ac2430570e6f42cb7e12bae3eea05064f1fd20",
>>>>>>>>>         "container_image_name": "quay.io/ceph/ceph:v15",
>>>>>>>>>         "container_image_id":
>>>>>>>>> "93146564743febec815d6a764dad93fc07ce971e88315403ac508cb5da6d35f4",
>>>>>>>>>         "version": "15.2.17",
>>>>>>>>>         "started": "2022-08-19T16:03:10.612432Z",
>>>>>>>>>         "created": "2022-08-19T16:03:09.765746Z",
>>>>>>>>>         "deployed": "2022-08-19T16:03:09.141746Z",
>>>>>>>>>         "configured": "2022-08-31T02:53:34.224643Z"
>>>>>>>>>     },
>>>>>>>>>     {
>>>>>>>>>         "style": "cephadm:v1",
>>>>>>>>>         "name": "prometheus.ceph1",
>>>>>>>>>         "fsid": "f270ad9e-1f6f-11ed-b6f8-a539d87379ea",
>>>>>>>>>         "systemd_unit":
>>>>>>>>> "ceph-f270ad9e-1f6f-11ed-b6f8-a539d87379ea@prometheus.ceph1",
>>>>>>>>>         "enabled": true,
>>>>>>>>>         "state": "running",
>>>>>>>>>         "container_id":
>>>>>>>>> "ba305236e5db9f2095b23b86a2340924909e9e8e54e5cdbe1d51c14dc4c8587a",
>>>>>>>>>         "container_image_name": "
>>>>>>>>> quay.io/prometheus/prometheus:v2.18.1",
>>>>>>>>>         "container_image_id":
>>>>>>>>> "de242295e2257c37c8cadfd962369228f8f10b2d48a44259b65fef44ad4f6490",
>>>>>>>>>         "version": "2.18.1",
>>>>>>>>>         "started": "2022-08-19T16:59:03.538981Z",
>>>>>>>>>         "created": "2022-08-19T03:38:01.567604Z",
>>>>>>>>>         "deployed": "2022-08-19T03:38:00.983603Z",
>>>>>>>>>         "configured": "2022-08-19T16:59:03.193607Z"
>>>>>>>>>     },
>>>>>>>>>     {
>>>>>>>>>         "style": "cephadm:v1",
>>>>>>>>>         "name": "node-exporter.ceph1",
>>>>>>>>>         "fsid": "f270ad9e-1f6f-11ed-b6f8-a539d87379ea",
>>>>>>>>>         "systemd_unit":
>>>>>>>>> "ceph-f270ad9e-1f6f-11ed-b6f8-a539d87379ea@node-exporter.ceph1",
>>>>>>>>>         "enabled": true,
>>>>>>>>>         "state": "running",
>>>>>>>>>         "container_id":
>>>>>>>>> "00bf3ad29cce79e905e8533648ef38cbd232990fa9616aff1c0020b7b66d0cc0",
>>>>>>>>>         "container_image_name": "
>>>>>>>>> quay.io/prometheus/node-exporter:v0.18.1",
>>>>>>>>>         "container_image_id":
>>>>>>>>> "e5a616e4b9cf68dfcad7782b78e118be4310022e874d52da85c55923fb615f87",
>>>>>>>>>         "version": "0.18.1",
>>>>>>>>>         "started": "2022-08-19T03:37:55.232032Z",
>>>>>>>>>         "created": "2022-08-19T03:37:47.711604Z",
>>>>>>>>>         "deployed": "2022-08-19T03:37:47.155604Z",
>>>>>>>>>         "configured": "2022-08-19T03:37:47.711604Z"
>>>>>>>>>     },
>>>>>>>>>     {
>>>>>>>>>         "style": "cephadm:v1",
>>>>>>>>>         "name": "osd.0",
>>>>>>>>>         "fsid": "f270ad9e-1f6f-11ed-b6f8-a539d87379ea",
>>>>>>>>>         "systemd_unit":
>>>>>>>>> "ceph-f270ad9e-1f6f-11ed-b6f8-a539d87379ea@osd.0",
>>>>>>>>>         "enabled": true,
>>>>>>>>>         "state": "running",
>>>>>>>>>         "container_id":
>>>>>>>>> "6b69046972dfbdb53665228258a15b13bc13a462ca4e066a4eca0cd593442d2d",
>>>>>>>>>         "container_image_name": "quay.io/ceph/ceph:v15",
>>>>>>>>>         "container_image_id":
>>>>>>>>> "93146564743febec815d6a764dad93fc07ce971e88315403ac508cb5da6d35f4",
>>>>>>>>>         "version": "15.2.17",
>>>>>>>>>         "started": "2022-08-19T15:55:20.580157Z",
>>>>>>>>>         "created": "2022-08-19T15:55:19.725766Z",
>>>>>>>>>         "deployed": "2022-08-19T15:55:19.125766Z",
>>>>>>>>>         "configured": "2022-08-31T02:53:34.760643Z"
>>>>>>>>>     },
>>>>>>>>>     {
>>>>>>>>>         "style": "cephadm:v1",
>>>>>>>>>         "name": "crash.ceph1",
>>>>>>>>>         "fsid": "f270ad9e-1f6f-11ed-b6f8-a539d87379ea",
>>>>>>>>>         "systemd_unit":
>>>>>>>>> "ceph-f270ad9e-1f6f-11ed-b6f8-a539d87379ea@crash.ceph1",
>>>>>>>>>         "enabled": true,
>>>>>>>>>         "state": "running",
>>>>>>>>>         "container_id":
>>>>>>>>> "6bc56f478ccb96841fe86a540e284c175300b83dad9e906ae3230f22341c8293",
>>>>>>>>>         "container_image_name": "quay.io/ceph/ceph:v15",
>>>>>>>>>         "container_image_id":
>>>>>>>>> "93146564743febec815d6a764dad93fc07ce971e88315403ac508cb5da6d35f4",
>>>>>>>>>         "version": "15.2.17",
>>>>>>>>>         "started": "2022-08-19T03:37:17.660080Z",
>>>>>>>>>         "created": "2022-08-19T03:37:17.559605Z",
>>>>>>>>>         "deployed": "2022-08-19T03:37:16.991605Z",
>>>>>>>>>         "configured": "2022-08-19T03:37:17.559605Z"
>>>>>>>>>     },
>>>>>>>>>     {
>>>>>>>>>         "style": "cephadm:v1",
>>>>>>>>>         "name": "mon.ceph1",
>>>>>>>>>         "fsid": "f270ad9e-1f6f-11ed-b6f8-a539d87379ea",
>>>>>>>>>         "systemd_unit":
>>>>>>>>> "ceph-f270ad9e-1f6f-11ed-b6f8-a539d87379ea@mon.ceph1",
>>>>>>>>>         "enabled": true,
>>>>>>>>>         "state": "running",
>>>>>>>>>         "container_id":
>>>>>>>>> "d0f03130491daebbe783c4990c6a4383d49e7a0e2bdf8c5d1eed012865e5d875",
>>>>>>>>>         "container_image_name": "quay.io/ceph/ceph:v15",
>>>>>>>>>         "container_image_id":
>>>>>>>>> "93146564743febec815d6a764dad93fc07ce971e88315403ac508cb5da6d35f4",
>>>>>>>>>         "version": "15.2.17",
>>>>>>>>>         "started": "2022-08-19T03:36:21.804129Z",
>>>>>>>>>         "created": "2022-08-19T03:36:19.743608Z",
>>>>>>>>>         "deployed": "2022-08-19T03:36:18.439608Z",
>>>>>>>>>         "configured": "2022-08-19T03:38:05.931603Z"
>>>>>>>>>     },
>>>>>>>>>     {
>>>>>>>>>         "style": "cephadm:v1",
>>>>>>>>>         "name": "mgr.ceph1.xmbvsb",
>>>>>>>>>         "fsid": "f270ad9e-1f6f-11ed-b6f8-a539d87379ea",
>>>>>>>>>         "systemd_unit":
>>>>>>>>> "ceph-f270ad9e-1f6f-11ed-b6f8-a539d87379ea@mgr.ceph1.xmbvsb",
>>>>>>>>>         "enabled": true,
>>>>>>>>>         "state": "stopped",
>>>>>>>>>         "container_id": null,
>>>>>>>>>         "container_image_name": "quay.io/ceph/ceph:v15",
>>>>>>>>>         "container_image_id": null,
>>>>>>>>>         "version": null,
>>>>>>>>>         "started": null,
>>>>>>>>>         "created": "2022-08-19T03:36:22.815608Z",
>>>>>>>>>         "deployed": "2022-08-19T03:36:22.239608Z",
>>>>>>>>>         "configured": "2022-08-19T03:38:06.487603Z"
>>>>>>>>>     },
>>>>>>>>>     {
>>>>>>>>>         "style": "cephadm:v1",
>>>>>>>>>         "name": "osd.4",
>>>>>>>>>         "fsid": "f270ad9e-1f6f-11ed-b6f8-a539d87379ea",
>>>>>>>>>         "systemd_unit":
>>>>>>>>> "ceph-f270ad9e-1f6f-11ed-b6f8-a539d87379ea@osd.4",
>>>>>>>>>         "enabled": true,
>>>>>>>>>         "state": "running",
>>>>>>>>>         "container_id":
>>>>>>>>> "938840fe7fd0cb45cc26d077837c9847d7c7a7a68c7e1588d4bb4343c695a071",
>>>>>>>>>         "container_image_name": "quay.io/ceph/ceph:v15",
>>>>>>>>>         "container_image_id":
>>>>>>>>> "93146564743febec815d6a764dad93fc07ce971e88315403ac508cb5da6d35f4",
>>>>>>>>>         "version": "15.2.17",
>>>>>>>>>         "started": "2022-08-31T03:20:55.416219Z",
>>>>>>>>>         "created": "2022-08-23T21:46:49.458533Z",
>>>>>>>>>         "deployed": "2022-08-23T21:46:48.818533Z",
>>>>>>>>>         "configured": "2022-08-31T02:53:41.196643Z"
>>>>>>>>>     }
>>>>>>>>> ]
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I have noticed one more thing, I did docker stop
>>>>>>>>> <container_id_of_mgr> on ceph1 node and now my mgr container disappeared, I
>>>>>>>>> can't see it anywhere and not sure how do i bring back mgr because upgrade
>>>>>>>>> won't let me do anything if i don't have two mgr instance.
>>>>>>>>>
>>>>>>>>> root@ceph1:~# ceph -s
>>>>>>>>>   cluster:
>>>>>>>>>     id:     f270ad9e-1f6f-11ed-b6f8-a539d87379ea
>>>>>>>>>     health: HEALTH_WARN
>>>>>>>>>             4 stray daemon(s) not managed by cephadm
>>>>>>>>>
>>>>>>>>>   services:
>>>>>>>>>     mon: 1 daemons, quorum ceph1 (age 17h)
>>>>>>>>>     mgr: ceph2.hmbdla(active, since 5h)
>>>>>>>>>     osd: 6 osds: 6 up (since 40h), 6 in (since 8d)
>>>>>>>>>
>>>>>>>>>   data:
>>>>>>>>>     pools:   6 pools, 161 pgs
>>>>>>>>>     objects: 20.59k objects, 85 GiB
>>>>>>>>>     usage:   174 GiB used, 826 GiB / 1000 GiB avail
>>>>>>>>>     pgs:     161 active+clean
>>>>>>>>>
>>>>>>>>>   io:
>>>>>>>>>     client:   0 B/s rd, 12 KiB/s wr, 0 op/s rd, 2 op/s wr
>>>>>>>>>
>>>>>>>>>   progress:
>>>>>>>>>     Upgrade to quay.io/ceph/ceph:16.2.10 (0s)
>>>>>>>>>       [............................]
>>>>>>>>>
>>>>>>>>> I can see mgr count:2 but not sure how do i bring it back
>>>>>>>>>
>>>>>>>>> root@ceph1:~# ceph orch ls
>>>>>>>>> NAME                       PORTS        RUNNING  REFRESHED  AGE
>>>>>>>>>  PLACEMENT
>>>>>>>>> alertmanager               ?:9093,9094      1/1  20s ago    13d
>>>>>>>>>  count:1
>>>>>>>>> crash                                       2/2  20s ago    13d  *
>>>>>>>>> grafana                    ?:3000           1/1  20s ago    13d
>>>>>>>>>  count:1
>>>>>>>>> mgr                                         2/2  20s ago    13d
>>>>>>>>>  count:2
>>>>>>>>> mon                                         0/5  -          13d
>>>>>>>>>  <unmanaged>
>>>>>>>>> node-exporter              ?:9100           2/2  20s ago    13d  *
>>>>>>>>> osd                                           6  20s ago    -
>>>>>>>>>  <unmanaged>
>>>>>>>>> osd.all-available-devices                     0  -          13d  *
>>>>>>>>> osd.osd_spec_default                          0  -          8d   *
>>>>>>>>> prometheus                 ?:9095           1/1  20s ago    13d
>>>>>>>>>  count:1
>>>>>>>>>
>>>>>>>>> On Thu, Sep 1, 2022 at 12:28 PM Adam King <adking@xxxxxxxxxx>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Are there any extra directories in /var/lib/ceph or
>>>>>>>>>> /var/lib/ceph/<fsid> that appear to be for those OSDs on that host? When
>>>>>>>>>> cephadm builds the info it uses for "ceph orch ps" it's actually scraping
>>>>>>>>>> those directories. The output of "cephadm ls" on the host with the
>>>>>>>>>> duplicates could also potentially have some insights.
>>>>>>>>>>
>>>>>>>>>> On Thu, Sep 1, 2022 at 12:15 PM Satish Patel <
>>>>>>>>>> satish.txt@xxxxxxxxx> wrote:
>>>>>>>>>>
>>>>>>>>>>> Folks,
>>>>>>>>>>>
>>>>>>>>>>> I am playing with cephadm and life was good until I started
>>>>>>>>>>> upgrading from
>>>>>>>>>>> octopus to pacific. My upgrade process stuck after upgrading mgr
>>>>>>>>>>> and in
>>>>>>>>>>> logs now i can see following error
>>>>>>>>>>>
>>>>>>>>>>> root@ceph1:~# ceph log last cephadm
>>>>>>>>>>> 2022-09-01T14:40:45.739804+0000 mgr.ceph2.hmbdla (mgr.265806) 8 :
>>>>>>>>>>> cephadm [INF] Deploying daemon grafana.ceph1 on ceph1
>>>>>>>>>>> 2022-09-01T14:40:56.115693+0000 mgr.ceph2.hmbdla (mgr.265806) 14
>>>>>>>>>>> :
>>>>>>>>>>> cephadm [INF] Deploying daemon prometheus.ceph1 on ceph1
>>>>>>>>>>> 2022-09-01T14:41:11.856725+0000 mgr.ceph2.hmbdla (mgr.265806) 25
>>>>>>>>>>> :
>>>>>>>>>>> cephadm [INF] Reconfiguring alertmanager.ceph1 (dependencies
>>>>>>>>>>> changed)...
>>>>>>>>>>> 2022-09-01T14:41:11.861535+0000 mgr.ceph2.hmbdla (mgr.265806) 26
>>>>>>>>>>> :
>>>>>>>>>>> cephadm [INF] Reconfiguring daemon alertmanager.ceph1 on ceph1
>>>>>>>>>>> 2022-09-01T14:41:12.927852+0000 mgr.ceph2.hmbdla (mgr.265806) 27
>>>>>>>>>>> :
>>>>>>>>>>> cephadm [INF] Reconfiguring grafana.ceph1 (dependencies
>>>>>>>>>>> changed)...
>>>>>>>>>>> 2022-09-01T14:41:12.940615+0000 mgr.ceph2.hmbdla (mgr.265806) 28
>>>>>>>>>>> :
>>>>>>>>>>> cephadm [INF] Reconfiguring daemon grafana.ceph1 on ceph1
>>>>>>>>>>> 2022-09-01T14:41:14.056113+0000 mgr.ceph2.hmbdla (mgr.265806) 33
>>>>>>>>>>> :
>>>>>>>>>>> cephadm [INF] Found duplicate OSDs: osd.2 in status running on
>>>>>>>>>>> ceph1,
>>>>>>>>>>> osd.2 in status running on ceph2
>>>>>>>>>>> 2022-09-01T14:41:14.056437+0000 mgr.ceph2.hmbdla (mgr.265806) 34
>>>>>>>>>>> :
>>>>>>>>>>> cephadm [INF] Found duplicate OSDs: osd.5 in status running on
>>>>>>>>>>> ceph1,
>>>>>>>>>>> osd.5 in status running on ceph2
>>>>>>>>>>> 2022-09-01T14:41:14.056630+0000 mgr.ceph2.hmbdla (mgr.265806) 35
>>>>>>>>>>> :
>>>>>>>>>>> cephadm [INF] Found duplicate OSDs: osd.3 in status running on
>>>>>>>>>>> ceph1,
>>>>>>>>>>> osd.3 in status running on ceph2
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Not sure from where duplicate names came and how that happened.
>>>>>>>>>>> In
>>>>>>>>>>> following output i can't see any duplication
>>>>>>>>>>>
>>>>>>>>>>> root@ceph1:~# ceph osd tree
>>>>>>>>>>> ID  CLASS  WEIGHT   TYPE NAME       STATUS  REWEIGHT  PRI-AFF
>>>>>>>>>>> -1         0.97656  root default
>>>>>>>>>>> -3         0.48828      host ceph1
>>>>>>>>>>>  4    hdd  0.09769          osd.4       up   1.00000  1.00000
>>>>>>>>>>>  0    ssd  0.19530          osd.0       up   1.00000  1.00000
>>>>>>>>>>>  1    ssd  0.19530          osd.1       up   1.00000  1.00000
>>>>>>>>>>> -5         0.48828      host ceph2
>>>>>>>>>>>  5    hdd  0.09769          osd.5       up   1.00000  1.00000
>>>>>>>>>>>  2    ssd  0.19530          osd.2       up   1.00000  1.00000
>>>>>>>>>>>  3    ssd  0.19530          osd.3       up   1.00000  1.00000
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> But same time i can see duplicate OSD number in ceph1 and ceph2
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> root@ceph1:~# ceph orch ps
>>>>>>>>>>> NAME                 HOST   PORTS        STATUS
>>>>>>>>>>>  REFRESHED  AGE
>>>>>>>>>>>  MEM USE  MEM LIM  VERSION  IMAGE ID      CONTAINER ID
>>>>>>>>>>> alertmanager.ceph1   ceph1  *:9093,9094  running (20s)     2s
>>>>>>>>>>> ago  20s
>>>>>>>>>>>    17.1M        -           ba2b418f427c  856a4fe641f1
>>>>>>>>>>> alertmanager.ceph1   ceph2  *:9093,9094  running (20s)     3s
>>>>>>>>>>> ago  20s
>>>>>>>>>>>    17.1M        -           ba2b418f427c  856a4fe641f1
>>>>>>>>>>> crash.ceph2          ceph1               running (12d)     2s
>>>>>>>>>>> ago  12d
>>>>>>>>>>>    10.0M        -  15.2.17  93146564743f  0a009254afb0
>>>>>>>>>>> crash.ceph2          ceph2               running (12d)     3s
>>>>>>>>>>> ago  12d
>>>>>>>>>>>    10.0M        -  15.2.17  93146564743f  0a009254afb0
>>>>>>>>>>> grafana.ceph1        ceph1  *:3000       running (18s)     2s
>>>>>>>>>>> ago  19s
>>>>>>>>>>>    47.9M        -  8.3.5    dad864ee21e9  7d7a70b8ab7f
>>>>>>>>>>> grafana.ceph1        ceph2  *:3000       running (18s)     3s
>>>>>>>>>>> ago  19s
>>>>>>>>>>>    47.9M        -  8.3.5    dad864ee21e9  7d7a70b8ab7f
>>>>>>>>>>> mgr.ceph2.hmbdla     ceph1               running (13h)     2s
>>>>>>>>>>> ago  12d
>>>>>>>>>>>     506M        -  16.2.10  0d668911f040  6274723c35f7
>>>>>>>>>>> mgr.ceph2.hmbdla     ceph2               running (13h)     3s
>>>>>>>>>>> ago  12d
>>>>>>>>>>>     506M        -  16.2.10  0d668911f040  6274723c35f7
>>>>>>>>>>> node-exporter.ceph2  ceph1               running (91m)     2s
>>>>>>>>>>> ago  12d
>>>>>>>>>>>    60.7M        -  0.18.1   e5a616e4b9cf  d0ba04bb977c
>>>>>>>>>>> node-exporter.ceph2  ceph2               running (91m)     3s
>>>>>>>>>>> ago  12d
>>>>>>>>>>>    60.7M        -  0.18.1   e5a616e4b9cf  d0ba04bb977c
>>>>>>>>>>> osd.2                ceph1               running (12h)     2s
>>>>>>>>>>> ago  12d
>>>>>>>>>>>     867M    4096M  15.2.17  93146564743f  e286fb1c6302
>>>>>>>>>>> osd.2                ceph2               running (12h)     3s
>>>>>>>>>>> ago  12d
>>>>>>>>>>>     867M    4096M  15.2.17  93146564743f  e286fb1c6302
>>>>>>>>>>> osd.3                ceph1               running (12h)     2s
>>>>>>>>>>> ago  12d
>>>>>>>>>>>     978M    4096M  15.2.17  93146564743f  d3ae5d9f694f
>>>>>>>>>>> osd.3                ceph2               running (12h)     3s
>>>>>>>>>>> ago  12d
>>>>>>>>>>>     978M    4096M  15.2.17  93146564743f  d3ae5d9f694f
>>>>>>>>>>> osd.5                ceph1               running (12h)     2s
>>>>>>>>>>> ago   8d
>>>>>>>>>>>     225M    4096M  15.2.17  93146564743f  405068fb474e
>>>>>>>>>>> osd.5                ceph2               running (12h)     3s
>>>>>>>>>>> ago   8d
>>>>>>>>>>>     225M    4096M  15.2.17  93146564743f  405068fb474e
>>>>>>>>>>> prometheus.ceph1     ceph1  *:9095       running (8s)      2s
>>>>>>>>>>> ago   8s
>>>>>>>>>>>    30.4M        -           514e6a882f6e  9031dbe30cae
>>>>>>>>>>> prometheus.ceph1     ceph2  *:9095       running (8s)      3s
>>>>>>>>>>> ago   8s
>>>>>>>>>>>    30.4M        -           514e6a882f6e  9031dbe30cae
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Is this a bug or did I do something wrong? any workaround to get
>>>>>>>>>>> out
>>>>>>>>>>> from this condition?
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>>>>>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>>>>>>>>>
>>>>>>>>>>>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx