Hi Sebastian, thanks that seems to have worked. At least on one of the two nodes. But now I have another problem. It seems that all mgr daemons are gone and ceph command is stuck. [root@gedasvl02 ~]# cephadm ls | grep mgr I tried to deploy a new mgr but this doesn't seem to work either: [root@gedasvl02 ~]# cephadm ls | grep mgr [root@gedasvl02 ~]# cephadm deploy --fsid d0920c36-2368-11eb-a5de-005056b703af --name mgr.gedaopl03 INFO:cephadm:Deploy daemon mgr.gedaopl03 ... At least I can't see a mgr container on node gedaopl03: [root@gedaopl03 ~]# podman ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 63518d95201b docker.io/prom/node-exporter:v0.18.1 --no-collector.ti... 3 days ago Up 3 days ago ceph-d0920c36-2368-11eb-a5de-005056b703af-node-exporter.gedaopl03 aa9b57fd77b8 docker.io/ceph/ceph:v15 -n client.crash.g... 3 days ago Up 3 days ago ceph-d0920c36-2368-11eb-a5de-005056b703af-crash.gedaopl03 8b02715f9cb4 docker.io/ceph/ceph:v15 -n osd.2 -f --set... 3 days ago Up 3 days ago ceph-d0920c36-2368-11eb-a5de-005056b703af-osd.2 40f15a6357fe docker.io/ceph/ceph:v15 -n osd.7 -f --set... 3 days ago Up 3 days ago ceph-d0920c36-2368-11eb-a5de-005056b703af-osd.7 bda260378239 docker.io/ceph/ceph:v15 -n mds.cephfs.ged... 3 days ago Up 3 days ago ceph-d0920c36-2368-11eb-a5de-005056b703af-mds.cephfs.gedaopl03.kybzgy [root@gedaopl03 ~]# systemctl --failed UNIT LOAD ACTIVE SUB DESCRIPTION ● ceph-d0920c36-2368-11eb-a5de-005056b703af@crash.gedaopl03.service loaded failed failed Ceph crash.gedaopl03 for d0920c36-2368-11eb-a5de-005056b703af ● ceph-d0920c36-2368-11eb-a5de-005056b703af@mon.gedaopl03.service loaded failed failed Ceph mon.gedaopl03 for d0920c36-2368-11eb-a5de-005056b703af ● ceph-d0920c36-2368-11eb-a5de-005056b703af@node-exporter.gedaopl03.service loaded failed failed Ceph node-exporter.gedaopl03 for d0920c36-2368-11eb-a5de-005056b703af ● ceph-d0920c36-2368-11eb-a5de-005056b703af@osd.3.service loaded failed failed Ceph osd.3 for d0920c36-2368-11eb-a5de-005056b703af LOAD = Reflects whether the unit definition was properly loaded. ACTIVE = The high-level unit activation state, i.e. generalization of SUB. SUB = The low-level unit activation state, values depend on unit type. 4 loaded units listed. Pass --all to see loaded but inactive units, too. To show all installed unit files use 'systemctl list-unit-files'. Maybe it's best to just scrap the whole cluster. It is only for testing, but I guess it is also a good practice for recovery. :) Am 12. März 2021 um 12:35 schrieb Sebastian Wagner <swagner@xxxxxxxx>: Hi Oliver, # ssh gedaopl02 # cephadm rm-daemon osd.0 should do the trick. Be careful to remove the broken OSD :-) Best, Sebastian Am 11.03.21 um 22:10 schrieb Oliver Weinmann: Hi, On my 3 node Octopus 15.2.5 test cluster, that I haven't used for quite a while, I noticed that it shows some errors: [root@gedasvl02 ~]# ceph health detail INFO:cephadm:Inferring fsid d0920c36-2368-11eb-a5de-005056b703af INFO:cephadm:Inferring config /var/lib/ceph/d0920c36-2368-11eb-a5de-005056b703af/mon.gedasvl02/config INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15 HEALTH_WARN 2 failed cephadm daemon(s) [WRN] CEPHADM_FAILED_DAEMON: 2 failed cephadm daemon(s) daemon osd.0 on gedaopl02 is in error state daemon node-exporter.gedaopl01 on gedaopl01 is in error state The error about the osd.0 is strange since osd.0 is actually up and running but on a different node. I guess I missed to correctly remove it from node gedaopl02 and then added a new osd to a different node gedaopl01 and now there are duplicate osd ids for osd.0 and osd.2. [root@gedasvl02 ~]# ceph orch ps INFO:cephadm:Inferring fsid d0920c36-2368-11eb-a5de-005056b703af INFO:cephadm:Inferring config /var/lib/ceph/d0920c36-2368-11eb-a5de-005056b703af/mon.gedasvl02/config INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15 NAME HOST STATUS REFRESHED AGE VERSION IMAGE NAME IMAGE ID CONTAINER ID alertmanager.gedasvl02 gedasvl02 running (6h) 7m ago 4M 0.20.0 docker.io/prom/alertmanager:v0.20.0 0881eb8f169f 5b80fb977a5f crash.gedaopl01 gedaopl01 stopped 7m ago 4M 15.2.5 docker.io/ceph/ceph:v15 4405f6339e35 810cf432b6d6 crash.gedaopl02 gedaopl02 running (5h) 7m ago 4M 15.2.5 docker.io/ceph/ceph:v15 4405f6339e35 34ab264fd5ed crash.gedaopl03 gedaopl03 running (2d) 7m ago 2d 15.2.9 docker.io/ceph/ceph:v15 dfc483079636 233f30086d2d crash.gedasvl02 gedasvl02 running (6h) 7m ago 4M 15.2.5 docker.io/ceph/ceph:v15 4405f6339e35 ea3d3e7c4f58 grafana.gedasvl02 gedasvl02 running (6h) 7m ago 4M 6.6.2 docker.io/ceph/ceph-grafana:6.6.2 a0dce381714a 5a94f3e41c32 mds.cephfs.gedaopl01.zjuhem gedaopl01 stopped 7m ago 3M <unknown> docker.io/ceph/ceph:v15 <unknown> <unknown> mds.cephfs.gedasvl02.xsjtpi gedasvl02 running (6h) 7m ago 3M 15.2.5 docker.io/ceph/ceph:v15 4405f6339e35 26e7c8759d89 mgr.gedaopl03.zilwbl gedaopl03 running (7h) 7m ago 7h 15.2.9 docker.io/ceph/ceph:v15 dfc483079636 e18b6f40871c mon.gedaopl03 gedaopl03 running (7h) 7m ago 7h 15.2.9 docker.io/ceph/ceph:v15 dfc483079636 5afdf40e41ba mon.gedasvl02 gedasvl02 running (6h) 7m ago 4M 15.2.5 docker.io/ceph/ceph:v15 4405f6339e35 e83dfcd864aa node-exporter.gedaopl01 gedaopl01 error 7m ago 4M 0.18.1 docker.io/prom/node-exporter:v0.18.1 e5a616e4b9cf 0fefcfcc9639 node-exporter.gedaopl02 gedaopl02 running (5h) 7m ago 4M 0.18.1 docker.io/prom/node-exporter:v0.18.1 e5a616e4b9cf f459045b7e41 node-exporter.gedaopl03 gedaopl03 running (2d) 7m ago 2d 0.18.1 docker.io/prom/node-exporter:v0.18.1 e5a616e4b9cf 3bd9f8dd6d5b node-exporter.gedasvl02 gedasvl02 running (6h) 7m ago 4M 0.18.1 docker.io/prom/node-exporter:v0.18.1 e5a616e4b9cf 72e96963261e *osd.0 gedaopl01 running (5h) 7m ago 5h 15.2.5 docker.io/ceph/ceph:v15 4405f6339e35 ed76fafb1988** **osd.0 gedaopl02 error 7m ago 4M <unknown> docker.io/ceph/ceph:v15 <unknown> <unknown>* osd.1 gedaopl01 running (4h) 7m ago 3d 15.2.5 docker.io/ceph/ceph:v15 4405f6339e35 41a43733e601 *osd.2 gedaopl01 stopped 7m ago 4M <unknown> docker.io/ceph/ceph:v15 <unknown> <unknown>** **osd.2 gedaopl03 running (7h) 7m ago 7h 15.2.9 docker.io/ceph/ceph:v15 dfc483079636 ac9e660db2fb* osd.3 gedaopl03 running (7h) 7m ago 7h 15.2.9 docker.io/ceph/ceph:v15 dfc483079636 bde17b5bb2fb osd.4 gedaopl02 running (5h) 7m ago 3d 15.2.5 docker.io/ceph/ceph:v15 4405f6339e35 7cc3ef7c4469 osd.5 gedaopl02 running (5h) 7m ago 3d 15.2.5 docker.io/ceph/ceph:v15 4405f6339e35 761b96d235e4 osd.6 gedaopl02 running (5h) 7m ago 3d 15.2.5 docker.io/ceph/ceph:v15 4405f6339e35 d047b28fe2bd osd.7 gedaopl03 running (7h) 7m ago 7h 15.2.9 docker.io/ceph/ceph:v15 dfc483079636 3b54b01841f4 osd.8 gedaopl01 running (5h) 7m ago 5h 15.2.5 docker.io/ceph/ceph:v15 4405f6339e35 cdd308cdc82b prometheus.gedasvl02 gedasvl02 running (5h) 7m ago 4M 2.18.1 docker.io/prom/prometheus:v2.18.1 de242295e225 591cef3bbaa4 Is there a way to clean / purge the stopped and error ones? I don't know what is wrong with the node-exporter. Because looking at podman ps -a on gedaopl01 looks ok. Maybe also a zombie daemon? [root@gedaopl01 ~]# podman ps -a CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES e71898f7d038 docker.io/prom/node-exporter:v0.18.1 --no-collector.ti... 54 seconds ago Up 54 seconds ago ceph-d0920c36-2368-11eb-a5de-005056b703af-node-exporter.gedaopl01 41a43733e601 docker.io/ceph/ceph:v15 -n osd.1 -f --set... 5 hours ago Up 5 hours ago ceph-d0920c36-2368-11eb-a5de-005056b703af-osd.1 810cf432b6d6 docker.io/ceph/ceph:v15 -n client.crash.g... 6 hours ago Up 6 hours ago ceph-d0920c36-2368-11eb-a5de-005056b703af-crash.gedaopl01 cdd308cdc82b docker.io/ceph/ceph:v15 -n osd.8 -f --set... 6 hours ago Up 6 hours ago ceph-d0920c36-2368-11eb-a5de-005056b703af-osd.8 ed76fafb1988 docker.io/ceph/ceph:v15 -n osd.0 -f --set... 6 hours ago Up 6 hours ago ceph-d0920c36-2368-11eb-a5de-005056b703af-osd.0 I replaced the very old disks with some brand new SAMSUMG PM883 and would like to upgrade to 15.2.9. But the upgrade guide recommends to do this on a healthy cluster only. :) Cheers, Oliver _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx -- SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany (HRB 36809, AG Nürnberg). Geschäftsführer: Felix Imendörffer _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx