Unhealthy Cluster | Remove / Purge duplicate osds | Fix daemon

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

On my 3 node Octopus 15.2.5 test cluster, that I haven't used for quite a while, I noticed that it shows some errors:

[root@gedasvl02 ~]# ceph health detail
INFO:cephadm:Inferring fsid d0920c36-2368-11eb-a5de-005056b703af
INFO:cephadm:Inferring config /var/lib/ceph/d0920c36-2368-11eb-a5de-005056b703af/mon.gedasvl02/config
INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15
HEALTH_WARN 2 failed cephadm daemon(s)
[WRN] CEPHADM_FAILED_DAEMON: 2 failed cephadm daemon(s)
    daemon osd.0 on gedaopl02 is in error state
    daemon node-exporter.gedaopl01 on gedaopl01 is in error state

The error about the osd.0 is strange since osd.0 is actually up and running but on a different node. I guess I missed to correctly remove it from node gedaopl02 and then added a new osd to a different node gedaopl01 and now there are duplicate osd ids for osd.0 and osd.2.

[root@gedasvl02 ~]# ceph orch ps
INFO:cephadm:Inferring fsid d0920c36-2368-11eb-a5de-005056b703af
INFO:cephadm:Inferring config /var/lib/ceph/d0920c36-2368-11eb-a5de-005056b703af/mon.gedasvl02/config
INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15
NAME                         HOST       STATUS        REFRESHED AGE  VERSION    IMAGE NAME                            IMAGE ID      CONTAINER ID alertmanager.gedasvl02       gedasvl02  running (6h)  7m ago 4M   0.20.0     docker.io/prom/alertmanager:v0.20.0 0881eb8f169f  5b80fb977a5f crash.gedaopl01              gedaopl01  stopped       7m ago 4M   15.2.5     docker.io/ceph/ceph:v15 4405f6339e35  810cf432b6d6 crash.gedaopl02              gedaopl02  running (5h)  7m ago 4M   15.2.5     docker.io/ceph/ceph:v15 4405f6339e35  34ab264fd5ed crash.gedaopl03              gedaopl03  running (2d)  7m ago 2d   15.2.9     docker.io/ceph/ceph:v15 dfc483079636  233f30086d2d crash.gedasvl02              gedasvl02  running (6h)  7m ago 4M   15.2.5     docker.io/ceph/ceph:v15 4405f6339e35  ea3d3e7c4f58 grafana.gedasvl02            gedasvl02  running (6h)  7m ago 4M   6.6.2      docker.io/ceph/ceph-grafana:6.6.2 a0dce381714a  5a94f3e41c32 mds.cephfs.gedaopl01.zjuhem  gedaopl01  stopped       7m ago 3M   <unknown>  docker.io/ceph/ceph:v15 <unknown>     <unknown> mds.cephfs.gedasvl02.xsjtpi  gedasvl02  running (6h)  7m ago 3M   15.2.5     docker.io/ceph/ceph:v15 4405f6339e35  26e7c8759d89 mgr.gedaopl03.zilwbl         gedaopl03  running (7h)  7m ago 7h   15.2.9     docker.io/ceph/ceph:v15 dfc483079636  e18b6f40871c mon.gedaopl03                gedaopl03  running (7h)  7m ago 7h   15.2.9     docker.io/ceph/ceph:v15 dfc483079636  5afdf40e41ba mon.gedasvl02                gedasvl02  running (6h)  7m ago 4M   15.2.5     docker.io/ceph/ceph:v15 4405f6339e35  e83dfcd864aa node-exporter.gedaopl01      gedaopl01  error         7m ago 4M   0.18.1     docker.io/prom/node-exporter:v0.18.1 e5a616e4b9cf  0fefcfcc9639 node-exporter.gedaopl02      gedaopl02  running (5h)  7m ago 4M   0.18.1     docker.io/prom/node-exporter:v0.18.1 e5a616e4b9cf  f459045b7e41 node-exporter.gedaopl03      gedaopl03  running (2d)  7m ago 2d   0.18.1     docker.io/prom/node-exporter:v0.18.1 e5a616e4b9cf  3bd9f8dd6d5b node-exporter.gedasvl02      gedasvl02  running (6h)  7m ago 4M   0.18.1     docker.io/prom/node-exporter:v0.18.1 e5a616e4b9cf  72e96963261e *osd.0                        gedaopl01  running (5h)  7m ago     5h   15.2.5     docker.io/ceph/ceph:v15 4405f6339e35  ed76fafb1988** **osd.0                        gedaopl02  error         7m ago     4M   <unknown> docker.io/ceph/ceph:v15               <unknown> <unknown>* osd.1                        gedaopl01  running (4h)  7m ago 3d   15.2.5     docker.io/ceph/ceph:v15 4405f6339e35  41a43733e601 *osd.2                        gedaopl01  stopped       7m ago     4M   <unknown> docker.io/ceph/ceph:v15               <unknown> <unknown>** **osd.2                        gedaopl03  running (7h)  7m ago     7h   15.2.9     docker.io/ceph/ceph:v15 dfc483079636  ac9e660db2fb* osd.3                        gedaopl03  running (7h)  7m ago 7h   15.2.9     docker.io/ceph/ceph:v15 dfc483079636  bde17b5bb2fb osd.4                        gedaopl02  running (5h)  7m ago 3d   15.2.5     docker.io/ceph/ceph:v15 4405f6339e35  7cc3ef7c4469 osd.5                        gedaopl02  running (5h)  7m ago 3d   15.2.5     docker.io/ceph/ceph:v15 4405f6339e35  761b96d235e4 osd.6                        gedaopl02  running (5h)  7m ago 3d   15.2.5     docker.io/ceph/ceph:v15 4405f6339e35  d047b28fe2bd osd.7                        gedaopl03  running (7h)  7m ago 7h   15.2.9     docker.io/ceph/ceph:v15 dfc483079636  3b54b01841f4 osd.8                        gedaopl01  running (5h)  7m ago 5h   15.2.5     docker.io/ceph/ceph:v15 4405f6339e35  cdd308cdc82b prometheus.gedasvl02         gedasvl02  running (5h)  7m ago 4M   2.18.1     docker.io/prom/prometheus:v2.18.1 de242295e225  591cef3bbaa4

Is there a way to clean / purge the stopped and error ones?

I don't know what is wrong with the node-exporter. Because looking at podman ps -a on gedaopl01 looks ok. Maybe also a zombie daemon?

[root@gedaopl01 ~]# podman ps -a
CONTAINER ID  IMAGE COMMAND               CREATED         STATUS             PORTS NAMES e71898f7d038  docker.io/prom/node-exporter:v0.18.1 --no-collector.ti...  54 seconds ago  Up 54 seconds ago ceph-d0920c36-2368-11eb-a5de-005056b703af-node-exporter.gedaopl01 41a43733e601  docker.io/ceph/ceph:v15               -n osd.1 -f --set...  5 hours ago     Up 5 hours ago ceph-d0920c36-2368-11eb-a5de-005056b703af-osd.1 810cf432b6d6  docker.io/ceph/ceph:v15               -n client.crash.g...  6 hours ago     Up 6 hours ago ceph-d0920c36-2368-11eb-a5de-005056b703af-crash.gedaopl01 cdd308cdc82b  docker.io/ceph/ceph:v15               -n osd.8 -f --set...  6 hours ago     Up 6 hours ago ceph-d0920c36-2368-11eb-a5de-005056b703af-osd.8 ed76fafb1988  docker.io/ceph/ceph:v15               -n osd.0 -f --set...  6 hours ago     Up 6 hours ago ceph-d0920c36-2368-11eb-a5de-005056b703af-osd.0

I replaced the very old disks with some brand new SAMSUMG PM883 and would like to upgrade to 15.2.9. But the upgrade guide recommends to do this on a healthy cluster only. :)

Cheers,

Oliver

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux