Hi,
On my 3 node Octopus 15.2.5 test cluster, that I haven't used for quite
a while, I noticed that it shows some errors:
[root@gedasvl02 ~]# ceph health detail
INFO:cephadm:Inferring fsid d0920c36-2368-11eb-a5de-005056b703af
INFO:cephadm:Inferring config
/var/lib/ceph/d0920c36-2368-11eb-a5de-005056b703af/mon.gedasvl02/config
INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15
HEALTH_WARN 2 failed cephadm daemon(s)
[WRN] CEPHADM_FAILED_DAEMON: 2 failed cephadm daemon(s)
daemon osd.0 on gedaopl02 is in error state
daemon node-exporter.gedaopl01 on gedaopl01 is in error state
The error about the osd.0 is strange since osd.0 is actually up and
running but on a different node. I guess I missed to correctly remove it
from node gedaopl02 and then added a new osd to a different node
gedaopl01 and now there are duplicate osd ids for osd.0 and osd.2.
[root@gedasvl02 ~]# ceph orch ps
INFO:cephadm:Inferring fsid d0920c36-2368-11eb-a5de-005056b703af
INFO:cephadm:Inferring config
/var/lib/ceph/d0920c36-2368-11eb-a5de-005056b703af/mon.gedasvl02/config
INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15
NAME HOST STATUS REFRESHED AGE
VERSION IMAGE NAME IMAGE ID CONTAINER ID
alertmanager.gedasvl02 gedasvl02 running (6h) 7m ago 4M
0.20.0 docker.io/prom/alertmanager:v0.20.0 0881eb8f169f 5b80fb977a5f
crash.gedaopl01 gedaopl01 stopped 7m ago 4M
15.2.5 docker.io/ceph/ceph:v15 4405f6339e35 810cf432b6d6
crash.gedaopl02 gedaopl02 running (5h) 7m ago 4M
15.2.5 docker.io/ceph/ceph:v15 4405f6339e35 34ab264fd5ed
crash.gedaopl03 gedaopl03 running (2d) 7m ago 2d
15.2.9 docker.io/ceph/ceph:v15 dfc483079636 233f30086d2d
crash.gedasvl02 gedasvl02 running (6h) 7m ago 4M
15.2.5 docker.io/ceph/ceph:v15 4405f6339e35 ea3d3e7c4f58
grafana.gedasvl02 gedasvl02 running (6h) 7m ago 4M
6.6.2 docker.io/ceph/ceph-grafana:6.6.2 a0dce381714a 5a94f3e41c32
mds.cephfs.gedaopl01.zjuhem gedaopl01 stopped 7m ago 3M
<unknown> docker.io/ceph/ceph:v15 <unknown> <unknown>
mds.cephfs.gedasvl02.xsjtpi gedasvl02 running (6h) 7m ago 3M
15.2.5 docker.io/ceph/ceph:v15 4405f6339e35 26e7c8759d89
mgr.gedaopl03.zilwbl gedaopl03 running (7h) 7m ago 7h
15.2.9 docker.io/ceph/ceph:v15 dfc483079636 e18b6f40871c
mon.gedaopl03 gedaopl03 running (7h) 7m ago 7h
15.2.9 docker.io/ceph/ceph:v15 dfc483079636 5afdf40e41ba
mon.gedasvl02 gedasvl02 running (6h) 7m ago 4M
15.2.5 docker.io/ceph/ceph:v15 4405f6339e35 e83dfcd864aa
node-exporter.gedaopl01 gedaopl01 error 7m ago 4M
0.18.1 docker.io/prom/node-exporter:v0.18.1 e5a616e4b9cf 0fefcfcc9639
node-exporter.gedaopl02 gedaopl02 running (5h) 7m ago 4M
0.18.1 docker.io/prom/node-exporter:v0.18.1 e5a616e4b9cf f459045b7e41
node-exporter.gedaopl03 gedaopl03 running (2d) 7m ago 2d
0.18.1 docker.io/prom/node-exporter:v0.18.1 e5a616e4b9cf 3bd9f8dd6d5b
node-exporter.gedasvl02 gedasvl02 running (6h) 7m ago 4M
0.18.1 docker.io/prom/node-exporter:v0.18.1 e5a616e4b9cf 72e96963261e
*osd.0 gedaopl01 running (5h) 7m ago 5h
15.2.5 docker.io/ceph/ceph:v15 4405f6339e35 ed76fafb1988**
**osd.0 gedaopl02 error 7m ago 4M
<unknown> docker.io/ceph/ceph:v15 <unknown> <unknown>*
osd.1 gedaopl01 running (4h) 7m ago 3d
15.2.5 docker.io/ceph/ceph:v15 4405f6339e35 41a43733e601
*osd.2 gedaopl01 stopped 7m ago 4M
<unknown> docker.io/ceph/ceph:v15 <unknown> <unknown>**
**osd.2 gedaopl03 running (7h) 7m ago 7h
15.2.9 docker.io/ceph/ceph:v15 dfc483079636 ac9e660db2fb*
osd.3 gedaopl03 running (7h) 7m ago 7h
15.2.9 docker.io/ceph/ceph:v15 dfc483079636 bde17b5bb2fb
osd.4 gedaopl02 running (5h) 7m ago 3d
15.2.5 docker.io/ceph/ceph:v15 4405f6339e35 7cc3ef7c4469
osd.5 gedaopl02 running (5h) 7m ago 3d
15.2.5 docker.io/ceph/ceph:v15 4405f6339e35 761b96d235e4
osd.6 gedaopl02 running (5h) 7m ago 3d
15.2.5 docker.io/ceph/ceph:v15 4405f6339e35 d047b28fe2bd
osd.7 gedaopl03 running (7h) 7m ago 7h
15.2.9 docker.io/ceph/ceph:v15 dfc483079636 3b54b01841f4
osd.8 gedaopl01 running (5h) 7m ago 5h
15.2.5 docker.io/ceph/ceph:v15 4405f6339e35 cdd308cdc82b
prometheus.gedasvl02 gedasvl02 running (5h) 7m ago 4M
2.18.1 docker.io/prom/prometheus:v2.18.1 de242295e225 591cef3bbaa4
Is there a way to clean / purge the stopped and error ones?
I don't know what is wrong with the node-exporter. Because looking at
podman ps -a on gedaopl01 looks ok. Maybe also a zombie daemon?
[root@gedaopl01 ~]# podman ps -a
CONTAINER ID IMAGE COMMAND CREATED
STATUS PORTS NAMES
e71898f7d038 docker.io/prom/node-exporter:v0.18.1 --no-collector.ti...
54 seconds ago Up 54 seconds ago
ceph-d0920c36-2368-11eb-a5de-005056b703af-node-exporter.gedaopl01
41a43733e601 docker.io/ceph/ceph:v15 -n osd.1 -f
--set... 5 hours ago Up 5 hours ago
ceph-d0920c36-2368-11eb-a5de-005056b703af-osd.1
810cf432b6d6 docker.io/ceph/ceph:v15 -n
client.crash.g... 6 hours ago Up 6 hours ago
ceph-d0920c36-2368-11eb-a5de-005056b703af-crash.gedaopl01
cdd308cdc82b docker.io/ceph/ceph:v15 -n osd.8 -f
--set... 6 hours ago Up 6 hours ago
ceph-d0920c36-2368-11eb-a5de-005056b703af-osd.8
ed76fafb1988 docker.io/ceph/ceph:v15 -n osd.0 -f
--set... 6 hours ago Up 6 hours ago
ceph-d0920c36-2368-11eb-a5de-005056b703af-osd.0
I replaced the very old disks with some brand new SAMSUMG PM883 and
would like to upgrade to 15.2.9. But the upgrade guide recommends to do
this on a healthy cluster only. :)
Cheers,
Oliver
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx