Hi Adam, You are correct, look like it was a naming issue in my /etc/hosts file. Is there a way to correct it? If you see i have ceph1 two time. :( 10.73.0.191 ceph1.example.com ceph1 10.73.0.192 ceph2.example.com ceph1 On Thu, Sep 1, 2022 at 8:06 PM Adam King <adking@xxxxxxxxxx> wrote: > the naming for daemons is a bit different for each daemon type, but for > mgr daemons it's always "mgr.<hostname>.<random-6-chars>". The daemons > cephadm will be able to find for something like a daemon redeploy are > pretty much always whatever is reported in "ceph orch ps". Given that > "mgr.ceph1.xmbvsb" isn't listed there, it's not surprising it said it > couldn't find it. > > There is definitely something very odd going on here. It looks like the > crash daemons as well are reporting a duplicate "crash.ceph2" on both ceph1 > and ceph2. Going back to your original orch ps output from the first email, > it seems that every daemon seems to have a duplicate and none of the actual > daemons listed in the "cephadm ls" on ceph1 are actually being reported in > the orch ps output. I think something may have gone wrong with the host and > networking setup here and it seems to be reporting ceph2 daemons as the > daemons for both ceph1 and ceph2 as if trying to connect to ceph1 ends up > connecting to ceph2. The only time I've seen anything like this was when I > made a mistake and setup a virtual IP on one host that was the same as the > actual IP for another host on the cluster and cephadm basically ended up > ssh-ing to the same host via both IPs (the one that was supposed to be for > host A and host B where the virtual IP matching host B was setup on host > A). I doubt you're in that exact situation, but I think we need to look > very closely at the networking setup here. I would try opening up a cephadm > shell and ssh-ing to each of the two hosts by the IP listed in "ceph orch > host ls" and make sure you actually get to the correct host and it has the > correct hostname. Given the output, I wouldn't be surprised if trying to > connect to ceph1's IP landed you on ceph2 or vice versa. I will say I found > it a bit odd originally when I saw the two IPs were 10.73.0.192 and > 10.73.3.192. There's nothing necessarily wrong with that, but typically IPs > on the host are more likely to differ at the end than in the middle (e.g. > 192.168.122.1 and 192.168.122.2 rather than 192.168.1.122 and > 192.168.2.122) and it did make me wonder if a mistake had occurred in the > networking. Either way, there's clearly something making it think ceph2's > daemons are on both ceph1 and ceph2 and some sort of networking issue is > the only thing I'm aware of currently that causes something like that. > > On Thu, Sep 1, 2022 at 6:30 PM Satish Patel <satish.txt@xxxxxxxxx> wrote: > >> Hi Adam, >> >> I have also noticed a very strange thing which is Duplicate name in the >> following output. Is this normal? I don't know how it got here. Is there >> a way I can rename them? >> >> root@ceph1:~# ceph orch ps >> NAME HOST PORTS STATUS REFRESHED AGE >> MEM USE MEM LIM VERSION IMAGE ID CONTAINER ID >> alertmanager.ceph1 ceph1 *:9093,9094 starting - - >> - - <unknown> <unknown> <unknown> >> crash.ceph2 ceph1 running (13d) 10s ago 13d >> 10.0M - 15.2.17 93146564743f 0a009254afb0 >> crash.ceph2 ceph2 running (13d) 10s ago 13d >> 10.0M - 15.2.17 93146564743f 0a009254afb0 >> grafana.ceph1 ceph1 *:3000 starting - - >> - - <unknown> <unknown> <unknown> >> mgr.ceph2.hmbdla ceph1 running (103m) 10s ago 13d >> 518M - 16.2.10 0d668911f040 745245c18d5e >> mgr.ceph2.hmbdla ceph2 running (103m) 10s ago 13d >> 518M - 16.2.10 0d668911f040 745245c18d5e >> node-exporter.ceph2 ceph1 running (7h) 10s ago 13d >> 70.2M - 0.18.1 e5a616e4b9cf d0ba04bb977c >> node-exporter.ceph2 ceph2 running (7h) 10s ago 13d >> 70.2M - 0.18.1 e5a616e4b9cf d0ba04bb977c >> osd.2 ceph1 running (19h) 10s ago 13d >> 901M 4096M 15.2.17 93146564743f e286fb1c6302 >> osd.2 ceph2 running (19h) 10s ago 13d >> 901M 4096M 15.2.17 93146564743f e286fb1c6302 >> osd.3 ceph1 running (19h) 10s ago 13d >> 1006M 4096M 15.2.17 93146564743f d3ae5d9f694f >> osd.3 ceph2 running (19h) 10s ago 13d >> 1006M 4096M 15.2.17 93146564743f d3ae5d9f694f >> osd.5 ceph1 running (19h) 10s ago 9d >> 222M 4096M 15.2.17 93146564743f 405068fb474e >> osd.5 ceph2 running (19h) 10s ago 9d >> 222M 4096M 15.2.17 93146564743f 405068fb474e >> prometheus.ceph1 ceph1 *:9095 running (15s) 10s ago 15s >> 30.6M - 514e6a882f6e 65a0acfed605 >> prometheus.ceph1 ceph2 *:9095 running (15s) 10s ago 15s >> 30.6M - 514e6a882f6e 65a0acfed605 >> >> I found the following example link which has all different names, how >> does cephadm decide naming? >> >> >> https://achchusnulchikam.medium.com/deploy-ceph-cluster-with-cephadm-on-centos-8-257b300e7b42 >> >> On Thu, Sep 1, 2022 at 6:20 PM Satish Patel <satish.txt@xxxxxxxxx> wrote: >> >>> Hi Adam, >>> >>> Getting the following error, not sure why it's not able to find it. >>> >>> root@ceph1:~# ceph orch daemon redeploy mgr.ceph1.xmbvsb >>> Error EINVAL: Unable to find mgr.ceph1.xmbvsb daemon(s) >>> >>> On Thu, Sep 1, 2022 at 5:57 PM Adam King <adking@xxxxxxxxxx> wrote: >>> >>>> what happens if you run `ceph orch daemon redeploy mgr.ceph1.xmbvsb`? >>>> >>>> On Thu, Sep 1, 2022 at 5:12 PM Satish Patel <satish.txt@xxxxxxxxx> >>>> wrote: >>>> >>>>> Hi Adam, >>>>> >>>>> Here is requested output >>>>> >>>>> root@ceph1:~# ceph health detail >>>>> HEALTH_WARN 4 stray daemon(s) not managed by cephadm >>>>> [WRN] CEPHADM_STRAY_DAEMON: 4 stray daemon(s) not managed by cephadm >>>>> stray daemon mon.ceph1 on host ceph1 not managed by cephadm >>>>> stray daemon osd.0 on host ceph1 not managed by cephadm >>>>> stray daemon osd.1 on host ceph1 not managed by cephadm >>>>> stray daemon osd.4 on host ceph1 not managed by cephadm >>>>> >>>>> >>>>> root@ceph1:~# ceph orch host ls >>>>> HOST ADDR LABELS STATUS >>>>> ceph1 10.73.0.192 >>>>> ceph2 10.73.3.192 _admin >>>>> 2 hosts in cluster >>>>> >>>>> >>>>> My cephadm ls saying mgr is in error state >>>>> >>>>> { >>>>> "style": "cephadm:v1", >>>>> "name": "mgr.ceph1.xmbvsb", >>>>> "fsid": "f270ad9e-1f6f-11ed-b6f8-a539d87379ea", >>>>> "systemd_unit": >>>>> "ceph-f270ad9e-1f6f-11ed-b6f8-a539d87379ea@mgr.ceph1.xmbvsb", >>>>> "enabled": true, >>>>> "state": "error", >>>>> "container_id": null, >>>>> "container_image_name": "quay.io/ceph/ceph:v15", >>>>> "container_image_id": null, >>>>> "version": null, >>>>> "started": null, >>>>> "created": "2022-09-01T20:59:49.314347Z", >>>>> "deployed": "2022-09-01T20:59:48.718347Z", >>>>> "configured": "2022-09-01T20:59:49.314347Z" >>>>> }, >>>>> >>>>> >>>>> Getting error >>>>> >>>>> root@ceph1:~# cephadm unit --fsid >>>>> f270ad9e-1f6f-11ed-b6f8-a539d87379ea --name mgr.ceph1.xmbvsb start >>>>> stderr Job for >>>>> ceph-f270ad9e-1f6f-11ed-b6f8-a539d87379ea@mgr.ceph1.xmbvsb.service >>>>> failed because the control process exited with error code. >>>>> stderr See "systemctl status >>>>> ceph-f270ad9e-1f6f-11ed-b6f8-a539d87379ea@mgr.ceph1.xmbvsb.service" >>>>> and "journalctl -xe" for details. >>>>> Traceback (most recent call last): >>>>> File "/usr/sbin/cephadm", line 6250, in <module> >>>>> r = args.func() >>>>> File "/usr/sbin/cephadm", line 1357, in _infer_fsid >>>>> return func() >>>>> File "/usr/sbin/cephadm", line 3727, in command_unit >>>>> call_throws([ >>>>> File "/usr/sbin/cephadm", line 1119, in call_throws >>>>> raise RuntimeError('Failed command: %s' % ' '.join(command)) >>>>> RuntimeError: Failed command: systemctl start >>>>> ceph-f270ad9e-1f6f-11ed-b6f8-a539d87379ea@mgr.ceph1.xmbvsb >>>>> >>>>> >>>>> How do I remove and re-deploy mgr? >>>>> >>>>> On Thu, Sep 1, 2022 at 4:54 PM Adam King <adking@xxxxxxxxxx> wrote: >>>>> >>>>>> cephadm deploys the containers with --rm so they will get removed if >>>>>> you stop them. As for getting the 2nd mgr back, if it still lists the 2nd >>>>>> one in `ceph orch ps` you should be able to do a `ceph orch daemon redeploy >>>>>> <mgr-daemon-name>` where <mgr-daemon-name> should match the name given in >>>>>> the orch ps output for the one that isn't actually up. If it isn't listed >>>>>> there, given you have a count of 2, cephadm should deploy another one. I do >>>>>> see in the orch ls output you posted that it says the mgr service has "2/2" >>>>>> running which implies it believes a 2nd mgr is present (and you would >>>>>> therefore be able to try the daemon redeploy if that daemon isn't actually >>>>>> there). >>>>>> >>>>>> Is it still reporting the duplicate osds in orch ps? I see in the >>>>>> cephadm ls output on ceph1 that osd.2 isn't being reported, which was >>>>>> reported as being on ceph1 in the orch ps output in your original message >>>>>> in this thread. I'm interested in what `ceph health detail` is reporting >>>>>> now as well, as it says there are 4 stray daemons. Also, the `ceph orch >>>>>> host ls` output just to get a better grasp of the topology of this cluster. >>>>>> >>>>>> On Thu, Sep 1, 2022 at 3:50 PM Satish Patel <satish.txt@xxxxxxxxx> >>>>>> wrote: >>>>>> >>>>>>> Adam, >>>>>>> >>>>>>> I have posted a question related to upgrading earlier and this >>>>>>> thread is related to that, I have opened a new one because I found that >>>>>>> error in logs and thought the upgrade may be stuck because of duplicate >>>>>>> OSDs. >>>>>>> >>>>>>> root@ceph1:~# ls -l >>>>>>> /var/lib/ceph/f270ad9e-1f6f-11ed-b6f8-a539d87379ea/ >>>>>>> total 44 >>>>>>> drwx------ 3 nobody nogroup 4096 Aug 19 05:37 alertmanager.ceph1 >>>>>>> drwx------ 3 167 167 4096 Aug 19 05:36 crash >>>>>>> drwx------ 2 167 167 4096 Aug 19 05:37 crash.ceph1 >>>>>>> drwx------ 4 998 996 4096 Aug 19 05:37 grafana.ceph1 >>>>>>> drwx------ 2 167 167 4096 Aug 19 05:36 mgr.ceph1.xmbvsb >>>>>>> drwx------ 3 167 167 4096 Aug 19 05:36 mon.ceph1 >>>>>>> drwx------ 2 nobody nogroup 4096 Aug 19 05:37 node-exporter.ceph1 >>>>>>> drwx------ 2 167 167 4096 Aug 19 17:55 osd.0 >>>>>>> drwx------ 2 167 167 4096 Aug 19 18:03 osd.1 >>>>>>> drwx------ 2 167 167 4096 Aug 31 05:20 osd.4 >>>>>>> drwx------ 4 nobody nogroup 4096 Aug 19 05:38 prometheus.ceph1 >>>>>>> >>>>>>> Here is the output of cephadm ls >>>>>>> >>>>>>> root@ceph1:~# cephadm ls >>>>>>> [ >>>>>>> { >>>>>>> "style": "cephadm:v1", >>>>>>> "name": "alertmanager.ceph1", >>>>>>> "fsid": "f270ad9e-1f6f-11ed-b6f8-a539d87379ea", >>>>>>> "systemd_unit": >>>>>>> "ceph-f270ad9e-1f6f-11ed-b6f8-a539d87379ea@alertmanager.ceph1", >>>>>>> "enabled": true, >>>>>>> "state": "running", >>>>>>> "container_id": >>>>>>> "97403cf9799711461216b7f83e88c574da2b631c7c65233ebd82d8a216a48924", >>>>>>> "container_image_name": " >>>>>>> quay.io/prometheus/alertmanager:v0.20.0", >>>>>>> "container_image_id": >>>>>>> "0881eb8f169f5556a292b4e2c01d683172b12830a62a9225a98a8e206bb734f0", >>>>>>> "version": "0.20.0", >>>>>>> "started": "2022-08-19T16:59:02.461978Z", >>>>>>> "created": "2022-08-19T03:37:16.403605Z", >>>>>>> "deployed": "2022-08-19T03:37:15.815605Z", >>>>>>> "configured": "2022-08-19T16:59:02.117607Z" >>>>>>> }, >>>>>>> { >>>>>>> "style": "cephadm:v1", >>>>>>> "name": "grafana.ceph1", >>>>>>> "fsid": "f270ad9e-1f6f-11ed-b6f8-a539d87379ea", >>>>>>> "systemd_unit": >>>>>>> "ceph-f270ad9e-1f6f-11ed-b6f8-a539d87379ea@grafana.ceph1", >>>>>>> "enabled": true, >>>>>>> "state": "running", >>>>>>> "container_id": >>>>>>> "c7136aea8349a37dd9b320acd926c4bcbed95bc4549779e9580ed4290edc2117", >>>>>>> "container_image_name": "quay.io/ceph/ceph-grafana:6.7.4", >>>>>>> "container_image_id": >>>>>>> "557c83e11646f123a27b5e4b62ac6c45e7bb8b2e90d6044034d0db5b7019415c", >>>>>>> "version": "6.7.4", >>>>>>> "started": "2022-08-19T03:38:05.481992Z", >>>>>>> "created": "2022-08-19T03:37:46.823604Z", >>>>>>> "deployed": "2022-08-19T03:37:46.239604Z", >>>>>>> "configured": "2022-08-19T03:38:05.163603Z" >>>>>>> }, >>>>>>> { >>>>>>> "style": "cephadm:v1", >>>>>>> "name": "osd.1", >>>>>>> "fsid": "f270ad9e-1f6f-11ed-b6f8-a539d87379ea", >>>>>>> "systemd_unit": >>>>>>> "ceph-f270ad9e-1f6f-11ed-b6f8-a539d87379ea@osd.1", >>>>>>> "enabled": true, >>>>>>> "state": "running", >>>>>>> "container_id": >>>>>>> "51586b775bda0485c8b27b8401ac2430570e6f42cb7e12bae3eea05064f1fd20", >>>>>>> "container_image_name": "quay.io/ceph/ceph:v15", >>>>>>> "container_image_id": >>>>>>> "93146564743febec815d6a764dad93fc07ce971e88315403ac508cb5da6d35f4", >>>>>>> "version": "15.2.17", >>>>>>> "started": "2022-08-19T16:03:10.612432Z", >>>>>>> "created": "2022-08-19T16:03:09.765746Z", >>>>>>> "deployed": "2022-08-19T16:03:09.141746Z", >>>>>>> "configured": "2022-08-31T02:53:34.224643Z" >>>>>>> }, >>>>>>> { >>>>>>> "style": "cephadm:v1", >>>>>>> "name": "prometheus.ceph1", >>>>>>> "fsid": "f270ad9e-1f6f-11ed-b6f8-a539d87379ea", >>>>>>> "systemd_unit": >>>>>>> "ceph-f270ad9e-1f6f-11ed-b6f8-a539d87379ea@prometheus.ceph1", >>>>>>> "enabled": true, >>>>>>> "state": "running", >>>>>>> "container_id": >>>>>>> "ba305236e5db9f2095b23b86a2340924909e9e8e54e5cdbe1d51c14dc4c8587a", >>>>>>> "container_image_name": " >>>>>>> quay.io/prometheus/prometheus:v2.18.1", >>>>>>> "container_image_id": >>>>>>> "de242295e2257c37c8cadfd962369228f8f10b2d48a44259b65fef44ad4f6490", >>>>>>> "version": "2.18.1", >>>>>>> "started": "2022-08-19T16:59:03.538981Z", >>>>>>> "created": "2022-08-19T03:38:01.567604Z", >>>>>>> "deployed": "2022-08-19T03:38:00.983603Z", >>>>>>> "configured": "2022-08-19T16:59:03.193607Z" >>>>>>> }, >>>>>>> { >>>>>>> "style": "cephadm:v1", >>>>>>> "name": "node-exporter.ceph1", >>>>>>> "fsid": "f270ad9e-1f6f-11ed-b6f8-a539d87379ea", >>>>>>> "systemd_unit": >>>>>>> "ceph-f270ad9e-1f6f-11ed-b6f8-a539d87379ea@node-exporter.ceph1", >>>>>>> "enabled": true, >>>>>>> "state": "running", >>>>>>> "container_id": >>>>>>> "00bf3ad29cce79e905e8533648ef38cbd232990fa9616aff1c0020b7b66d0cc0", >>>>>>> "container_image_name": " >>>>>>> quay.io/prometheus/node-exporter:v0.18.1", >>>>>>> "container_image_id": >>>>>>> "e5a616e4b9cf68dfcad7782b78e118be4310022e874d52da85c55923fb615f87", >>>>>>> "version": "0.18.1", >>>>>>> "started": "2022-08-19T03:37:55.232032Z", >>>>>>> "created": "2022-08-19T03:37:47.711604Z", >>>>>>> "deployed": "2022-08-19T03:37:47.155604Z", >>>>>>> "configured": "2022-08-19T03:37:47.711604Z" >>>>>>> }, >>>>>>> { >>>>>>> "style": "cephadm:v1", >>>>>>> "name": "osd.0", >>>>>>> "fsid": "f270ad9e-1f6f-11ed-b6f8-a539d87379ea", >>>>>>> "systemd_unit": >>>>>>> "ceph-f270ad9e-1f6f-11ed-b6f8-a539d87379ea@osd.0", >>>>>>> "enabled": true, >>>>>>> "state": "running", >>>>>>> "container_id": >>>>>>> "6b69046972dfbdb53665228258a15b13bc13a462ca4e066a4eca0cd593442d2d", >>>>>>> "container_image_name": "quay.io/ceph/ceph:v15", >>>>>>> "container_image_id": >>>>>>> "93146564743febec815d6a764dad93fc07ce971e88315403ac508cb5da6d35f4", >>>>>>> "version": "15.2.17", >>>>>>> "started": "2022-08-19T15:55:20.580157Z", >>>>>>> "created": "2022-08-19T15:55:19.725766Z", >>>>>>> "deployed": "2022-08-19T15:55:19.125766Z", >>>>>>> "configured": "2022-08-31T02:53:34.760643Z" >>>>>>> }, >>>>>>> { >>>>>>> "style": "cephadm:v1", >>>>>>> "name": "crash.ceph1", >>>>>>> "fsid": "f270ad9e-1f6f-11ed-b6f8-a539d87379ea", >>>>>>> "systemd_unit": >>>>>>> "ceph-f270ad9e-1f6f-11ed-b6f8-a539d87379ea@crash.ceph1", >>>>>>> "enabled": true, >>>>>>> "state": "running", >>>>>>> "container_id": >>>>>>> "6bc56f478ccb96841fe86a540e284c175300b83dad9e906ae3230f22341c8293", >>>>>>> "container_image_name": "quay.io/ceph/ceph:v15", >>>>>>> "container_image_id": >>>>>>> "93146564743febec815d6a764dad93fc07ce971e88315403ac508cb5da6d35f4", >>>>>>> "version": "15.2.17", >>>>>>> "started": "2022-08-19T03:37:17.660080Z", >>>>>>> "created": "2022-08-19T03:37:17.559605Z", >>>>>>> "deployed": "2022-08-19T03:37:16.991605Z", >>>>>>> "configured": "2022-08-19T03:37:17.559605Z" >>>>>>> }, >>>>>>> { >>>>>>> "style": "cephadm:v1", >>>>>>> "name": "mon.ceph1", >>>>>>> "fsid": "f270ad9e-1f6f-11ed-b6f8-a539d87379ea", >>>>>>> "systemd_unit": >>>>>>> "ceph-f270ad9e-1f6f-11ed-b6f8-a539d87379ea@mon.ceph1", >>>>>>> "enabled": true, >>>>>>> "state": "running", >>>>>>> "container_id": >>>>>>> "d0f03130491daebbe783c4990c6a4383d49e7a0e2bdf8c5d1eed012865e5d875", >>>>>>> "container_image_name": "quay.io/ceph/ceph:v15", >>>>>>> "container_image_id": >>>>>>> "93146564743febec815d6a764dad93fc07ce971e88315403ac508cb5da6d35f4", >>>>>>> "version": "15.2.17", >>>>>>> "started": "2022-08-19T03:36:21.804129Z", >>>>>>> "created": "2022-08-19T03:36:19.743608Z", >>>>>>> "deployed": "2022-08-19T03:36:18.439608Z", >>>>>>> "configured": "2022-08-19T03:38:05.931603Z" >>>>>>> }, >>>>>>> { >>>>>>> "style": "cephadm:v1", >>>>>>> "name": "mgr.ceph1.xmbvsb", >>>>>>> "fsid": "f270ad9e-1f6f-11ed-b6f8-a539d87379ea", >>>>>>> "systemd_unit": >>>>>>> "ceph-f270ad9e-1f6f-11ed-b6f8-a539d87379ea@mgr.ceph1.xmbvsb", >>>>>>> "enabled": true, >>>>>>> "state": "stopped", >>>>>>> "container_id": null, >>>>>>> "container_image_name": "quay.io/ceph/ceph:v15", >>>>>>> "container_image_id": null, >>>>>>> "version": null, >>>>>>> "started": null, >>>>>>> "created": "2022-08-19T03:36:22.815608Z", >>>>>>> "deployed": "2022-08-19T03:36:22.239608Z", >>>>>>> "configured": "2022-08-19T03:38:06.487603Z" >>>>>>> }, >>>>>>> { >>>>>>> "style": "cephadm:v1", >>>>>>> "name": "osd.4", >>>>>>> "fsid": "f270ad9e-1f6f-11ed-b6f8-a539d87379ea", >>>>>>> "systemd_unit": >>>>>>> "ceph-f270ad9e-1f6f-11ed-b6f8-a539d87379ea@osd.4", >>>>>>> "enabled": true, >>>>>>> "state": "running", >>>>>>> "container_id": >>>>>>> "938840fe7fd0cb45cc26d077837c9847d7c7a7a68c7e1588d4bb4343c695a071", >>>>>>> "container_image_name": "quay.io/ceph/ceph:v15", >>>>>>> "container_image_id": >>>>>>> "93146564743febec815d6a764dad93fc07ce971e88315403ac508cb5da6d35f4", >>>>>>> "version": "15.2.17", >>>>>>> "started": "2022-08-31T03:20:55.416219Z", >>>>>>> "created": "2022-08-23T21:46:49.458533Z", >>>>>>> "deployed": "2022-08-23T21:46:48.818533Z", >>>>>>> "configured": "2022-08-31T02:53:41.196643Z" >>>>>>> } >>>>>>> ] >>>>>>> >>>>>>> >>>>>>> I have noticed one more thing, I did docker stop >>>>>>> <container_id_of_mgr> on ceph1 node and now my mgr container disappeared, I >>>>>>> can't see it anywhere and not sure how do i bring back mgr because upgrade >>>>>>> won't let me do anything if i don't have two mgr instance. >>>>>>> >>>>>>> root@ceph1:~# ceph -s >>>>>>> cluster: >>>>>>> id: f270ad9e-1f6f-11ed-b6f8-a539d87379ea >>>>>>> health: HEALTH_WARN >>>>>>> 4 stray daemon(s) not managed by cephadm >>>>>>> >>>>>>> services: >>>>>>> mon: 1 daemons, quorum ceph1 (age 17h) >>>>>>> mgr: ceph2.hmbdla(active, since 5h) >>>>>>> osd: 6 osds: 6 up (since 40h), 6 in (since 8d) >>>>>>> >>>>>>> data: >>>>>>> pools: 6 pools, 161 pgs >>>>>>> objects: 20.59k objects, 85 GiB >>>>>>> usage: 174 GiB used, 826 GiB / 1000 GiB avail >>>>>>> pgs: 161 active+clean >>>>>>> >>>>>>> io: >>>>>>> client: 0 B/s rd, 12 KiB/s wr, 0 op/s rd, 2 op/s wr >>>>>>> >>>>>>> progress: >>>>>>> Upgrade to quay.io/ceph/ceph:16.2.10 (0s) >>>>>>> [............................] >>>>>>> >>>>>>> I can see mgr count:2 but not sure how do i bring it back >>>>>>> >>>>>>> root@ceph1:~# ceph orch ls >>>>>>> NAME PORTS RUNNING REFRESHED AGE >>>>>>> PLACEMENT >>>>>>> alertmanager ?:9093,9094 1/1 20s ago 13d >>>>>>> count:1 >>>>>>> crash 2/2 20s ago 13d * >>>>>>> grafana ?:3000 1/1 20s ago 13d >>>>>>> count:1 >>>>>>> mgr 2/2 20s ago 13d >>>>>>> count:2 >>>>>>> mon 0/5 - 13d >>>>>>> <unmanaged> >>>>>>> node-exporter ?:9100 2/2 20s ago 13d * >>>>>>> osd 6 20s ago - >>>>>>> <unmanaged> >>>>>>> osd.all-available-devices 0 - 13d * >>>>>>> osd.osd_spec_default 0 - 8d * >>>>>>> prometheus ?:9095 1/1 20s ago 13d >>>>>>> count:1 >>>>>>> >>>>>>> On Thu, Sep 1, 2022 at 12:28 PM Adam King <adking@xxxxxxxxxx> wrote: >>>>>>> >>>>>>>> Are there any extra directories in /var/lib/ceph or >>>>>>>> /var/lib/ceph/<fsid> that appear to be for those OSDs on that host? When >>>>>>>> cephadm builds the info it uses for "ceph orch ps" it's actually scraping >>>>>>>> those directories. The output of "cephadm ls" on the host with the >>>>>>>> duplicates could also potentially have some insights. >>>>>>>> >>>>>>>> On Thu, Sep 1, 2022 at 12:15 PM Satish Patel <satish.txt@xxxxxxxxx> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Folks, >>>>>>>>> >>>>>>>>> I am playing with cephadm and life was good until I started >>>>>>>>> upgrading from >>>>>>>>> octopus to pacific. My upgrade process stuck after upgrading mgr >>>>>>>>> and in >>>>>>>>> logs now i can see following error >>>>>>>>> >>>>>>>>> root@ceph1:~# ceph log last cephadm >>>>>>>>> 2022-09-01T14:40:45.739804+0000 mgr.ceph2.hmbdla (mgr.265806) 8 : >>>>>>>>> cephadm [INF] Deploying daemon grafana.ceph1 on ceph1 >>>>>>>>> 2022-09-01T14:40:56.115693+0000 mgr.ceph2.hmbdla (mgr.265806) 14 : >>>>>>>>> cephadm [INF] Deploying daemon prometheus.ceph1 on ceph1 >>>>>>>>> 2022-09-01T14:41:11.856725+0000 mgr.ceph2.hmbdla (mgr.265806) 25 : >>>>>>>>> cephadm [INF] Reconfiguring alertmanager.ceph1 (dependencies >>>>>>>>> changed)... >>>>>>>>> 2022-09-01T14:41:11.861535+0000 mgr.ceph2.hmbdla (mgr.265806) 26 : >>>>>>>>> cephadm [INF] Reconfiguring daemon alertmanager.ceph1 on ceph1 >>>>>>>>> 2022-09-01T14:41:12.927852+0000 mgr.ceph2.hmbdla (mgr.265806) 27 : >>>>>>>>> cephadm [INF] Reconfiguring grafana.ceph1 (dependencies changed)... >>>>>>>>> 2022-09-01T14:41:12.940615+0000 mgr.ceph2.hmbdla (mgr.265806) 28 : >>>>>>>>> cephadm [INF] Reconfiguring daemon grafana.ceph1 on ceph1 >>>>>>>>> 2022-09-01T14:41:14.056113+0000 mgr.ceph2.hmbdla (mgr.265806) 33 : >>>>>>>>> cephadm [INF] Found duplicate OSDs: osd.2 in status running on >>>>>>>>> ceph1, >>>>>>>>> osd.2 in status running on ceph2 >>>>>>>>> 2022-09-01T14:41:14.056437+0000 mgr.ceph2.hmbdla (mgr.265806) 34 : >>>>>>>>> cephadm [INF] Found duplicate OSDs: osd.5 in status running on >>>>>>>>> ceph1, >>>>>>>>> osd.5 in status running on ceph2 >>>>>>>>> 2022-09-01T14:41:14.056630+0000 mgr.ceph2.hmbdla (mgr.265806) 35 : >>>>>>>>> cephadm [INF] Found duplicate OSDs: osd.3 in status running on >>>>>>>>> ceph1, >>>>>>>>> osd.3 in status running on ceph2 >>>>>>>>> >>>>>>>>> >>>>>>>>> Not sure from where duplicate names came and how that happened. In >>>>>>>>> following output i can't see any duplication >>>>>>>>> >>>>>>>>> root@ceph1:~# ceph osd tree >>>>>>>>> ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF >>>>>>>>> -1 0.97656 root default >>>>>>>>> -3 0.48828 host ceph1 >>>>>>>>> 4 hdd 0.09769 osd.4 up 1.00000 1.00000 >>>>>>>>> 0 ssd 0.19530 osd.0 up 1.00000 1.00000 >>>>>>>>> 1 ssd 0.19530 osd.1 up 1.00000 1.00000 >>>>>>>>> -5 0.48828 host ceph2 >>>>>>>>> 5 hdd 0.09769 osd.5 up 1.00000 1.00000 >>>>>>>>> 2 ssd 0.19530 osd.2 up 1.00000 1.00000 >>>>>>>>> 3 ssd 0.19530 osd.3 up 1.00000 1.00000 >>>>>>>>> >>>>>>>>> >>>>>>>>> But same time i can see duplicate OSD number in ceph1 and ceph2 >>>>>>>>> >>>>>>>>> >>>>>>>>> root@ceph1:~# ceph orch ps >>>>>>>>> NAME HOST PORTS STATUS REFRESHED >>>>>>>>> AGE >>>>>>>>> MEM USE MEM LIM VERSION IMAGE ID CONTAINER ID >>>>>>>>> alertmanager.ceph1 ceph1 *:9093,9094 running (20s) 2s ago >>>>>>>>> 20s >>>>>>>>> 17.1M - ba2b418f427c 856a4fe641f1 >>>>>>>>> alertmanager.ceph1 ceph2 *:9093,9094 running (20s) 3s ago >>>>>>>>> 20s >>>>>>>>> 17.1M - ba2b418f427c 856a4fe641f1 >>>>>>>>> crash.ceph2 ceph1 running (12d) 2s ago >>>>>>>>> 12d >>>>>>>>> 10.0M - 15.2.17 93146564743f 0a009254afb0 >>>>>>>>> crash.ceph2 ceph2 running (12d) 3s ago >>>>>>>>> 12d >>>>>>>>> 10.0M - 15.2.17 93146564743f 0a009254afb0 >>>>>>>>> grafana.ceph1 ceph1 *:3000 running (18s) 2s ago >>>>>>>>> 19s >>>>>>>>> 47.9M - 8.3.5 dad864ee21e9 7d7a70b8ab7f >>>>>>>>> grafana.ceph1 ceph2 *:3000 running (18s) 3s ago >>>>>>>>> 19s >>>>>>>>> 47.9M - 8.3.5 dad864ee21e9 7d7a70b8ab7f >>>>>>>>> mgr.ceph2.hmbdla ceph1 running (13h) 2s ago >>>>>>>>> 12d >>>>>>>>> 506M - 16.2.10 0d668911f040 6274723c35f7 >>>>>>>>> mgr.ceph2.hmbdla ceph2 running (13h) 3s ago >>>>>>>>> 12d >>>>>>>>> 506M - 16.2.10 0d668911f040 6274723c35f7 >>>>>>>>> node-exporter.ceph2 ceph1 running (91m) 2s ago >>>>>>>>> 12d >>>>>>>>> 60.7M - 0.18.1 e5a616e4b9cf d0ba04bb977c >>>>>>>>> node-exporter.ceph2 ceph2 running (91m) 3s ago >>>>>>>>> 12d >>>>>>>>> 60.7M - 0.18.1 e5a616e4b9cf d0ba04bb977c >>>>>>>>> osd.2 ceph1 running (12h) 2s ago >>>>>>>>> 12d >>>>>>>>> 867M 4096M 15.2.17 93146564743f e286fb1c6302 >>>>>>>>> osd.2 ceph2 running (12h) 3s ago >>>>>>>>> 12d >>>>>>>>> 867M 4096M 15.2.17 93146564743f e286fb1c6302 >>>>>>>>> osd.3 ceph1 running (12h) 2s ago >>>>>>>>> 12d >>>>>>>>> 978M 4096M 15.2.17 93146564743f d3ae5d9f694f >>>>>>>>> osd.3 ceph2 running (12h) 3s ago >>>>>>>>> 12d >>>>>>>>> 978M 4096M 15.2.17 93146564743f d3ae5d9f694f >>>>>>>>> osd.5 ceph1 running (12h) 2s ago >>>>>>>>> 8d >>>>>>>>> 225M 4096M 15.2.17 93146564743f 405068fb474e >>>>>>>>> osd.5 ceph2 running (12h) 3s ago >>>>>>>>> 8d >>>>>>>>> 225M 4096M 15.2.17 93146564743f 405068fb474e >>>>>>>>> prometheus.ceph1 ceph1 *:9095 running (8s) 2s ago >>>>>>>>> 8s >>>>>>>>> 30.4M - 514e6a882f6e 9031dbe30cae >>>>>>>>> prometheus.ceph1 ceph2 *:9095 running (8s) 3s ago >>>>>>>>> 8s >>>>>>>>> 30.4M - 514e6a882f6e 9031dbe30cae >>>>>>>>> >>>>>>>>> >>>>>>>>> Is this a bug or did I do something wrong? any workaround to get >>>>>>>>> out >>>>>>>>> from this condition? >>>>>>>>> _______________________________________________ >>>>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx >>>>>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >>>>>>>>> >>>>>>>>> _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx