Unhealthy Cluster | Remove / Purge duplicate osds | Fix daemon

Oliver Weinmann <oliver.weinmann@xxxxxx> · Thu, 11 Mar 2021 22:10:14 +0100

Hi,

On my 3 node Octopus 15.2.5 test cluster, that I haven't used for quite 
a while, I noticed that it shows some errors:

[root@gedasvl02 ~]# ceph health detail
INFO:cephadm:Inferring fsid d0920c36-2368-11eb-a5de-005056b703af
INFO:cephadm:Inferring config 
/var/lib/ceph/d0920c36-2368-11eb-a5de-005056b703af/mon.gedasvl02/config
INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15
HEALTH_WARN 2 failed cephadm daemon(s)
[WRN] CEPHADM_FAILED_DAEMON: 2 failed cephadm daemon(s)
    daemon osd.0 on gedaopl02 is in error state
    daemon node-exporter.gedaopl01 on gedaopl01 is in error state

The error about the osd.0 is strange since osd.0 is actually up and 
running but on a different node. I guess I missed to correctly remove it 
from node gedaopl02 and then added a new osd to a different node 
gedaopl01 and now there are duplicate osd ids for osd.0 and osd.2.

[root@gedasvl02 ~]# ceph orch ps
INFO:cephadm:Inferring fsid d0920c36-2368-11eb-a5de-005056b703af
INFO:cephadm:Inferring config 
/var/lib/ceph/d0920c36-2368-11eb-a5de-005056b703af/mon.gedasvl02/config
INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15
NAME                         HOST       STATUS        REFRESHED AGE  
VERSION    IMAGE NAME                            IMAGE ID      CONTAINER ID
alertmanager.gedasvl02       gedasvl02  running (6h)  7m ago 4M   
0.20.0     docker.io/prom/alertmanager:v0.20.0 0881eb8f169f  5b80fb977a5f
crash.gedaopl01              gedaopl01  stopped       7m ago 4M   
15.2.5     docker.io/ceph/ceph:v15 4405f6339e35  810cf432b6d6
crash.gedaopl02              gedaopl02  running (5h)  7m ago 4M   
15.2.5     docker.io/ceph/ceph:v15 4405f6339e35  34ab264fd5ed
crash.gedaopl03              gedaopl03  running (2d)  7m ago 2d   
15.2.9     docker.io/ceph/ceph:v15 dfc483079636  233f30086d2d
crash.gedasvl02              gedasvl02  running (6h)  7m ago 4M   
15.2.5     docker.io/ceph/ceph:v15 4405f6339e35  ea3d3e7c4f58
grafana.gedasvl02            gedasvl02  running (6h)  7m ago 4M   
6.6.2      docker.io/ceph/ceph-grafana:6.6.2 a0dce381714a  5a94f3e41c32
mds.cephfs.gedaopl01.zjuhem  gedaopl01  stopped       7m ago 3M   
<unknown>  docker.io/ceph/ceph:v15 <unknown>     <unknown>
mds.cephfs.gedasvl02.xsjtpi  gedasvl02  running (6h)  7m ago 3M   
15.2.5     docker.io/ceph/ceph:v15 4405f6339e35  26e7c8759d89
mgr.gedaopl03.zilwbl         gedaopl03  running (7h)  7m ago 7h   
15.2.9     docker.io/ceph/ceph:v15 dfc483079636  e18b6f40871c
mon.gedaopl03                gedaopl03  running (7h)  7m ago 7h   
15.2.9     docker.io/ceph/ceph:v15 dfc483079636  5afdf40e41ba
mon.gedasvl02                gedasvl02  running (6h)  7m ago 4M   
15.2.5     docker.io/ceph/ceph:v15 4405f6339e35  e83dfcd864aa
node-exporter.gedaopl01      gedaopl01  error         7m ago 4M   
0.18.1     docker.io/prom/node-exporter:v0.18.1 e5a616e4b9cf  0fefcfcc9639
node-exporter.gedaopl02      gedaopl02  running (5h)  7m ago 4M   
0.18.1     docker.io/prom/node-exporter:v0.18.1 e5a616e4b9cf  f459045b7e41
node-exporter.gedaopl03      gedaopl03  running (2d)  7m ago 2d   
0.18.1     docker.io/prom/node-exporter:v0.18.1 e5a616e4b9cf  3bd9f8dd6d5b
node-exporter.gedasvl02      gedasvl02  running (6h)  7m ago 4M   
0.18.1     docker.io/prom/node-exporter:v0.18.1 e5a616e4b9cf  72e96963261e
*osd.0                        gedaopl01  running (5h)  7m ago     5h   
15.2.5     docker.io/ceph/ceph:v15 4405f6339e35  ed76fafb1988**
**osd.0                        gedaopl02  error         7m ago     4M   
<unknown> docker.io/ceph/ceph:v15               <unknown> <unknown>*
osd.1                        gedaopl01  running (4h)  7m ago 3d   
15.2.5     docker.io/ceph/ceph:v15 4405f6339e35  41a43733e601
*osd.2                        gedaopl01  stopped       7m ago     4M   
<unknown> docker.io/ceph/ceph:v15               <unknown> <unknown>**
**osd.2                        gedaopl03  running (7h)  7m ago     7h   
15.2.9     docker.io/ceph/ceph:v15 dfc483079636  ac9e660db2fb*
osd.3                        gedaopl03  running (7h)  7m ago 7h   
15.2.9     docker.io/ceph/ceph:v15 dfc483079636  bde17b5bb2fb
osd.4                        gedaopl02  running (5h)  7m ago 3d   
15.2.5     docker.io/ceph/ceph:v15 4405f6339e35  7cc3ef7c4469
osd.5                        gedaopl02  running (5h)  7m ago 3d   
15.2.5     docker.io/ceph/ceph:v15 4405f6339e35  761b96d235e4
osd.6                        gedaopl02  running (5h)  7m ago 3d   
15.2.5     docker.io/ceph/ceph:v15 4405f6339e35  d047b28fe2bd
osd.7                        gedaopl03  running (7h)  7m ago 7h   
15.2.9     docker.io/ceph/ceph:v15 dfc483079636  3b54b01841f4
osd.8                        gedaopl01  running (5h)  7m ago 5h   
15.2.5     docker.io/ceph/ceph:v15 4405f6339e35  cdd308cdc82b
prometheus.gedasvl02         gedasvl02  running (5h)  7m ago 4M   
2.18.1     docker.io/prom/prometheus:v2.18.1 de242295e225  591cef3bbaa4

Is there a way to clean / purge the stopped and error ones?

I don't know what is wrong with the node-exporter. Because looking at 
podman ps -a on gedaopl01 looks ok. Maybe also a zombie daemon?

[root@gedaopl01 ~]# podman ps -a
CONTAINER ID  IMAGE COMMAND               CREATED         
STATUS             PORTS NAMES
e71898f7d038  docker.io/prom/node-exporter:v0.18.1 --no-collector.ti...  
54 seconds ago  Up 54 seconds ago 
ceph-d0920c36-2368-11eb-a5de-005056b703af-node-exporter.gedaopl01
41a43733e601  docker.io/ceph/ceph:v15               -n osd.1 -f 
--set...  5 hours ago     Up 5 hours ago 
ceph-d0920c36-2368-11eb-a5de-005056b703af-osd.1
810cf432b6d6  docker.io/ceph/ceph:v15               -n 
client.crash.g...  6 hours ago     Up 6 hours ago 
ceph-d0920c36-2368-11eb-a5de-005056b703af-crash.gedaopl01
cdd308cdc82b  docker.io/ceph/ceph:v15               -n osd.8 -f 
--set...  6 hours ago     Up 6 hours ago 
ceph-d0920c36-2368-11eb-a5de-005056b703af-osd.8
ed76fafb1988  docker.io/ceph/ceph:v15               -n osd.0 -f 
--set...  6 hours ago     Up 6 hours ago 
ceph-d0920c36-2368-11eb-a5de-005056b703af-osd.0

I replaced the very old disks with some brand new SAMSUMG PM883 and 
would like to upgrade to 15.2.9. But the upgrade guide recommends to do 
this on a healthy cluster only. :)

Cheers,

Oliver

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx