Hi Oliver, I don't know how you managed to remove all MGRs from the cluster, but there is the documentation to manually recover from this: > https://docs.ceph.com/en/latest/cephadm/troubleshooting/#manually-deploying-a-mgr-daemon Hope that helps, Sebastian Am 15.03.21 um 18:24 schrieb Oliver Weinmann: > Hi Sebastian, > > thanks that seems to have worked. At least on one of the two nodes. But > now I have another problem. It seems that all mgr daemons are gone and > ceph command is stuck. > > [root@gedasvl02 ~]# cephadm ls | grep mgr > > I tried to deploy a new mgr but this doesn't seem to work either: > > [root@gedasvl02 ~]# cephadm ls | grep mgr > [root@gedasvl02 ~]# cephadm deploy --fsid > d0920c36-2368-11eb-a5de-005056b703af --name mgr.gedaopl03 > INFO:cephadm:Deploy daemon mgr.gedaopl03 ... > > At least I can't see a mgr container on node gedaopl03: > > [root@gedaopl03 ~]# podman ps > CONTAINER ID IMAGE > COMMAND CREATED STATUS PORTS NAMES > 63518d95201b docker.io/prom/node-exporter:v0.18.1 > --no-collector.ti... 3 days ago Up 3 days ago > ceph-d0920c36-2368-11eb-a5de-005056b703af-node-exporter.gedaopl03 > aa9b57fd77b8 docker.io/ceph/ceph:v15 -n > client.crash.g... 3 days ago Up 3 days ago > ceph-d0920c36-2368-11eb-a5de-005056b703af-crash.gedaopl03 > 8b02715f9cb4 docker.io/ceph/ceph:v15 -n osd.2 -f > --set... 3 days ago Up 3 days ago > ceph-d0920c36-2368-11eb-a5de-005056b703af-osd.2 > 40f15a6357fe docker.io/ceph/ceph:v15 -n osd.7 -f > --set... 3 days ago Up 3 days ago > ceph-d0920c36-2368-11eb-a5de-005056b703af-osd.7 > bda260378239 docker.io/ceph/ceph:v15 -n > mds.cephfs.ged... 3 days ago Up 3 days ago > ceph-d0920c36-2368-11eb-a5de-005056b703af-mds.cephfs.gedaopl03.kybzgy > [root@gedaopl03 ~]# systemctl --failed > > UNIT LOAD > ACTIVE SUB DESCRIPTION > ● > ceph-d0920c36-2368-11eb-a5de-005056b703af@crash.gedaopl03.service loaded > failed failed Ceph crash.gedaopl03 for d0920c36-2368-11eb-a5de-005056b703af > ● > ceph-d0920c36-2368-11eb-a5de-005056b703af@mon.gedaopl03.service loaded > failed failed Ceph mon.gedaopl03 for d0920c36-2368-11eb-a5de-005056b703af > ● > ceph-d0920c36-2368-11eb-a5de-005056b703af@node-exporter.gedaopl03.service loaded > failed failed Ceph node-exporter.gedaopl03 for > d0920c36-2368-11eb-a5de-005056b703af > ● > ceph-d0920c36-2368-11eb-a5de-005056b703af@osd.3.service loaded > failed failed Ceph osd.3 for d0920c36-2368-11eb-a5de-005056b703af > > LOAD = Reflects whether the unit definition was properly loaded. > ACTIVE = The high-level unit activation state, i.e. generalization of SUB. > SUB = The low-level unit activation state, values depend on unit type. > > 4 loaded units listed. Pass --all to see loaded but inactive units, too. > To show all installed unit files use 'systemctl list-unit-files'. > > Maybe it's best to just scrap the whole cluster. It is only for testing, > but I guess it is also a good practice for recovery. :) > > Am 12. März 2021 um 12:35 schrieb Sebastian Wagner <swagner@xxxxxxxx>: > >> Hi Oliver, >> >> # ssh gedaopl02 >> # cephadm rm-daemon osd.0 >> >> should do the trick. >> >> Be careful to remove the broken OSD :-) >> >> Best, >> >> Sebastian >> >> Am 11.03.21 um 22:10 schrieb Oliver Weinmann: >>> Hi, >>> >>> On my 3 node Octopus 15.2.5 test cluster, that I haven't used for quite >>> a while, I noticed that it shows some errors: >>> >>> [root@gedasvl02 ~]# ceph health detail >>> INFO:cephadm:Inferring fsid d0920c36-2368-11eb-a5de-005056b703af >>> INFO:cephadm:Inferring config >>> /var/lib/ceph/d0920c36-2368-11eb-a5de-005056b703af/mon.gedasvl02/config >>> INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15 >>> HEALTH_WARN 2 failed cephadm daemon(s) >>> [WRN] CEPHADM_FAILED_DAEMON: 2 failed cephadm daemon(s) >>> daemon osd.0 on gedaopl02 is in error state >>> daemon node-exporter.gedaopl01 on gedaopl01 is in error state >>> >>> The error about the osd.0 is strange since osd.0 is actually up and >>> running but on a different node. I guess I missed to correctly remove it >>> from node gedaopl02 and then added a new osd to a different node >>> gedaopl01 and now there are duplicate osd ids for osd.0 and osd.2. >>> >>> [root@gedasvl02 ~]# ceph orch ps >>> INFO:cephadm:Inferring fsid d0920c36-2368-11eb-a5de-005056b703af >>> INFO:cephadm:Inferring config >>> /var/lib/ceph/d0920c36-2368-11eb-a5de-005056b703af/mon.gedasvl02/config >>> INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15 >>> NAME HOST STATUS REFRESHED AGE >>> VERSION IMAGE NAME IMAGE ID >>> CONTAINER ID >>> alertmanager.gedasvl02 gedasvl02 running (6h) 7m ago 4M >>> 0.20.0 docker.io/prom/alertmanager:v0.20.0 0881eb8f169f 5b80fb977a5f >>> crash.gedaopl01 gedaopl01 stopped 7m ago 4M >>> 15.2.5 docker.io/ceph/ceph:v15 4405f6339e35 810cf432b6d6 >>> crash.gedaopl02 gedaopl02 running (5h) 7m ago 4M >>> 15.2.5 docker.io/ceph/ceph:v15 4405f6339e35 34ab264fd5ed >>> crash.gedaopl03 gedaopl03 running (2d) 7m ago 2d >>> 15.2.9 docker.io/ceph/ceph:v15 dfc483079636 233f30086d2d >>> crash.gedasvl02 gedasvl02 running (6h) 7m ago 4M >>> 15.2.5 docker.io/ceph/ceph:v15 4405f6339e35 ea3d3e7c4f58 >>> grafana.gedasvl02 gedasvl02 running (6h) 7m ago 4M >>> 6.6.2 docker.io/ceph/ceph-grafana:6.6.2 a0dce381714a 5a94f3e41c32 >>> mds.cephfs.gedaopl01.zjuhem gedaopl01 stopped 7m ago 3M >>> <unknown> docker.io/ceph/ceph:v15 <unknown> <unknown> >>> mds.cephfs.gedasvl02.xsjtpi gedasvl02 running (6h) 7m ago 3M >>> 15.2.5 docker.io/ceph/ceph:v15 4405f6339e35 26e7c8759d89 >>> mgr.gedaopl03.zilwbl gedaopl03 running (7h) 7m ago 7h >>> 15.2.9 docker.io/ceph/ceph:v15 dfc483079636 e18b6f40871c >>> mon.gedaopl03 gedaopl03 running (7h) 7m ago 7h >>> 15.2.9 docker.io/ceph/ceph:v15 dfc483079636 5afdf40e41ba >>> mon.gedasvl02 gedasvl02 running (6h) 7m ago 4M >>> 15.2.5 docker.io/ceph/ceph:v15 4405f6339e35 e83dfcd864aa >>> node-exporter.gedaopl01 gedaopl01 error 7m ago 4M >>> 0.18.1 docker.io/prom/node-exporter:v0.18.1 e5a616e4b9cf >>> 0fefcfcc9639 >>> node-exporter.gedaopl02 gedaopl02 running (5h) 7m ago 4M >>> 0.18.1 docker.io/prom/node-exporter:v0.18.1 e5a616e4b9cf >>> f459045b7e41 >>> node-exporter.gedaopl03 gedaopl03 running (2d) 7m ago 2d >>> 0.18.1 docker.io/prom/node-exporter:v0.18.1 e5a616e4b9cf >>> 3bd9f8dd6d5b >>> node-exporter.gedasvl02 gedasvl02 running (6h) 7m ago 4M >>> 0.18.1 docker.io/prom/node-exporter:v0.18.1 e5a616e4b9cf >>> 72e96963261e >>> *osd.0 gedaopl01 running (5h) 7m ago 5h >>> 15.2.5 docker.io/ceph/ceph:v15 4405f6339e35 ed76fafb1988** >>> **osd.0 gedaopl02 error 7m ago 4M >>> <unknown> docker.io/ceph/ceph:v15 <unknown> <unknown>* >>> osd.1 gedaopl01 running (4h) 7m ago 3d >>> 15.2.5 docker.io/ceph/ceph:v15 4405f6339e35 41a43733e601 >>> *osd.2 gedaopl01 stopped 7m ago 4M >>> <unknown> docker.io/ceph/ceph:v15 <unknown> <unknown>** >>> **osd.2 gedaopl03 running (7h) 7m ago 7h >>> 15.2.9 docker.io/ceph/ceph:v15 dfc483079636 ac9e660db2fb* >>> osd.3 gedaopl03 running (7h) 7m ago 7h >>> 15.2.9 docker.io/ceph/ceph:v15 dfc483079636 bde17b5bb2fb >>> osd.4 gedaopl02 running (5h) 7m ago 3d >>> 15.2.5 docker.io/ceph/ceph:v15 4405f6339e35 7cc3ef7c4469 >>> osd.5 gedaopl02 running (5h) 7m ago 3d >>> 15.2.5 docker.io/ceph/ceph:v15 4405f6339e35 761b96d235e4 >>> osd.6 gedaopl02 running (5h) 7m ago 3d >>> 15.2.5 docker.io/ceph/ceph:v15 4405f6339e35 d047b28fe2bd >>> osd.7 gedaopl03 running (7h) 7m ago 7h >>> 15.2.9 docker.io/ceph/ceph:v15 dfc483079636 3b54b01841f4 >>> osd.8 gedaopl01 running (5h) 7m ago 5h >>> 15.2.5 docker.io/ceph/ceph:v15 4405f6339e35 cdd308cdc82b >>> prometheus.gedasvl02 gedasvl02 running (5h) 7m ago 4M >>> 2.18.1 docker.io/prom/prometheus:v2.18.1 de242295e225 591cef3bbaa4 >>> >>> Is there a way to clean / purge the stopped and error ones? >>> >>> I don't know what is wrong with the node-exporter. Because looking at >>> podman ps -a on gedaopl01 looks ok. Maybe also a zombie daemon? >>> >>> [root@gedaopl01 ~]# podman ps -a >>> CONTAINER ID IMAGE COMMAND CREATED >>> STATUS PORTS NAMES >>> e71898f7d038 docker.io/prom/node-exporter:v0.18.1 --no-collector.ti... >>> 54 seconds ago Up 54 seconds ago >>> ceph-d0920c36-2368-11eb-a5de-005056b703af-node-exporter.gedaopl01 >>> 41a43733e601 docker.io/ceph/ceph:v15 -n osd.1 -f >>> --set... 5 hours ago Up 5 hours ago >>> ceph-d0920c36-2368-11eb-a5de-005056b703af-osd.1 >>> 810cf432b6d6 docker.io/ceph/ceph:v15 -n >>> client.crash.g... 6 hours ago Up 6 hours ago >>> ceph-d0920c36-2368-11eb-a5de-005056b703af-crash.gedaopl01 >>> cdd308cdc82b docker.io/ceph/ceph:v15 -n osd.8 -f >>> --set... 6 hours ago Up 6 hours ago >>> ceph-d0920c36-2368-11eb-a5de-005056b703af-osd.8 >>> ed76fafb1988 docker.io/ceph/ceph:v15 -n osd.0 -f >>> --set... 6 hours ago Up 6 hours ago >>> ceph-d0920c36-2368-11eb-a5de-005056b703af-osd.0 >>> >>> I replaced the very old disks with some brand new SAMSUMG PM883 and >>> would like to upgrade to 15.2.9. But the upgrade guide recommends to do >>> this on a healthy cluster only. :) >>> >>> Cheers, >>> >>> Oliver >>> >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx> >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >>> <mailto:ceph-users-leave@xxxxxxx> >> >> -- >> SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg, >> Germany >> (HRB 36809, AG Nürnberg). Geschäftsführer: Felix Imendörffer >> -- SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany (HRB 36809, AG Nürnberg). Geschäftsführer: Felix Imendörffer
Attachment:
signature.asc
Description: OpenPGP digital signature
_______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx