Re: Unhealthy Cluster | Remove / Purge duplicate osds | Fix daemon

Sebastian Wagner <swagner@xxxxxxxx> · Fri, 12 Mar 2021 12:35:24 +0100

Hi Oliver,

# ssh gedaopl02
# cephadm rm-daemon osd.0

should do the trick.

Be careful to remove the broken OSD :-)

Best,

Sebastian

Am 11.03.21 um 22:10 schrieb Oliver Weinmann:
> Hi,
> 
> On my 3 node Octopus 15.2.5 test cluster, that I haven't used for quite
> a while, I noticed that it shows some errors:
> 
> [root@gedasvl02 ~]# ceph health detail
> INFO:cephadm:Inferring fsid d0920c36-2368-11eb-a5de-005056b703af
> INFO:cephadm:Inferring config
> /var/lib/ceph/d0920c36-2368-11eb-a5de-005056b703af/mon.gedasvl02/config
> INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15
> HEALTH_WARN 2 failed cephadm daemon(s)
> [WRN] CEPHADM_FAILED_DAEMON: 2 failed cephadm daemon(s)
>     daemon osd.0 on gedaopl02 is in error state
>     daemon node-exporter.gedaopl01 on gedaopl01 is in error state
> 
> The error about the osd.0 is strange since osd.0 is actually up and
> running but on a different node. I guess I missed to correctly remove it
> from node gedaopl02 and then added a new osd to a different node
> gedaopl01 and now there are duplicate osd ids for osd.0 and osd.2.
> 
> [root@gedasvl02 ~]# ceph orch ps
> INFO:cephadm:Inferring fsid d0920c36-2368-11eb-a5de-005056b703af
> INFO:cephadm:Inferring config
> /var/lib/ceph/d0920c36-2368-11eb-a5de-005056b703af/mon.gedasvl02/config
> INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15
> NAME                         HOST       STATUS        REFRESHED AGE 
> VERSION    IMAGE NAME                            IMAGE ID      CONTAINER ID
> alertmanager.gedasvl02       gedasvl02  running (6h)  7m ago 4M  
> 0.20.0     docker.io/prom/alertmanager:v0.20.0 0881eb8f169f  5b80fb977a5f
> crash.gedaopl01              gedaopl01  stopped       7m ago 4M  
> 15.2.5     docker.io/ceph/ceph:v15 4405f6339e35  810cf432b6d6
> crash.gedaopl02              gedaopl02  running (5h)  7m ago 4M  
> 15.2.5     docker.io/ceph/ceph:v15 4405f6339e35  34ab264fd5ed
> crash.gedaopl03              gedaopl03  running (2d)  7m ago 2d  
> 15.2.9     docker.io/ceph/ceph:v15 dfc483079636  233f30086d2d
> crash.gedasvl02              gedasvl02  running (6h)  7m ago 4M  
> 15.2.5     docker.io/ceph/ceph:v15 4405f6339e35  ea3d3e7c4f58
> grafana.gedasvl02            gedasvl02  running (6h)  7m ago 4M  
> 6.6.2      docker.io/ceph/ceph-grafana:6.6.2 a0dce381714a  5a94f3e41c32
> mds.cephfs.gedaopl01.zjuhem  gedaopl01  stopped       7m ago 3M  
> <unknown>  docker.io/ceph/ceph:v15 <unknown>     <unknown>
> mds.cephfs.gedasvl02.xsjtpi  gedasvl02  running (6h)  7m ago 3M  
> 15.2.5     docker.io/ceph/ceph:v15 4405f6339e35  26e7c8759d89
> mgr.gedaopl03.zilwbl         gedaopl03  running (7h)  7m ago 7h  
> 15.2.9     docker.io/ceph/ceph:v15 dfc483079636  e18b6f40871c
> mon.gedaopl03                gedaopl03  running (7h)  7m ago 7h  
> 15.2.9     docker.io/ceph/ceph:v15 dfc483079636  5afdf40e41ba
> mon.gedasvl02                gedasvl02  running (6h)  7m ago 4M  
> 15.2.5     docker.io/ceph/ceph:v15 4405f6339e35  e83dfcd864aa
> node-exporter.gedaopl01      gedaopl01  error         7m ago 4M  
> 0.18.1     docker.io/prom/node-exporter:v0.18.1 e5a616e4b9cf  0fefcfcc9639
> node-exporter.gedaopl02      gedaopl02  running (5h)  7m ago 4M  
> 0.18.1     docker.io/prom/node-exporter:v0.18.1 e5a616e4b9cf  f459045b7e41
> node-exporter.gedaopl03      gedaopl03  running (2d)  7m ago 2d  
> 0.18.1     docker.io/prom/node-exporter:v0.18.1 e5a616e4b9cf  3bd9f8dd6d5b
> node-exporter.gedasvl02      gedasvl02  running (6h)  7m ago 4M  
> 0.18.1     docker.io/prom/node-exporter:v0.18.1 e5a616e4b9cf  72e96963261e
> *osd.0                        gedaopl01  running (5h)  7m ago     5h  
> 15.2.5     docker.io/ceph/ceph:v15 4405f6339e35  ed76fafb1988**
> **osd.0                        gedaopl02  error         7m ago     4M  
> <unknown> docker.io/ceph/ceph:v15               <unknown> <unknown>*
> osd.1                        gedaopl01  running (4h)  7m ago 3d  
> 15.2.5     docker.io/ceph/ceph:v15 4405f6339e35  41a43733e601
> *osd.2                        gedaopl01  stopped       7m ago     4M  
> <unknown> docker.io/ceph/ceph:v15               <unknown> <unknown>**
> **osd.2                        gedaopl03  running (7h)  7m ago     7h  
> 15.2.9     docker.io/ceph/ceph:v15 dfc483079636  ac9e660db2fb*
> osd.3                        gedaopl03  running (7h)  7m ago 7h  
> 15.2.9     docker.io/ceph/ceph:v15 dfc483079636  bde17b5bb2fb
> osd.4                        gedaopl02  running (5h)  7m ago 3d  
> 15.2.5     docker.io/ceph/ceph:v15 4405f6339e35  7cc3ef7c4469
> osd.5                        gedaopl02  running (5h)  7m ago 3d  
> 15.2.5     docker.io/ceph/ceph:v15 4405f6339e35  761b96d235e4
> osd.6                        gedaopl02  running (5h)  7m ago 3d  
> 15.2.5     docker.io/ceph/ceph:v15 4405f6339e35  d047b28fe2bd
> osd.7                        gedaopl03  running (7h)  7m ago 7h  
> 15.2.9     docker.io/ceph/ceph:v15 dfc483079636  3b54b01841f4
> osd.8                        gedaopl01  running (5h)  7m ago 5h  
> 15.2.5     docker.io/ceph/ceph:v15 4405f6339e35  cdd308cdc82b
> prometheus.gedasvl02         gedasvl02  running (5h)  7m ago 4M  
> 2.18.1     docker.io/prom/prometheus:v2.18.1 de242295e225  591cef3bbaa4
> 
> Is there a way to clean / purge the stopped and error ones?
> 
> I don't know what is wrong with the node-exporter. Because looking at
> podman ps -a on gedaopl01 looks ok. Maybe also a zombie daemon?
> 
> [root@gedaopl01 ~]# podman ps -a
> CONTAINER ID  IMAGE COMMAND               CREATED        
> STATUS             PORTS NAMES
> e71898f7d038  docker.io/prom/node-exporter:v0.18.1 --no-collector.ti... 
> 54 seconds ago  Up 54 seconds ago
> ceph-d0920c36-2368-11eb-a5de-005056b703af-node-exporter.gedaopl01
> 41a43733e601  docker.io/ceph/ceph:v15               -n osd.1 -f
> --set...  5 hours ago     Up 5 hours ago
> ceph-d0920c36-2368-11eb-a5de-005056b703af-osd.1
> 810cf432b6d6  docker.io/ceph/ceph:v15               -n
> client.crash.g...  6 hours ago     Up 6 hours ago
> ceph-d0920c36-2368-11eb-a5de-005056b703af-crash.gedaopl01
> cdd308cdc82b  docker.io/ceph/ceph:v15               -n osd.8 -f
> --set...  6 hours ago     Up 6 hours ago
> ceph-d0920c36-2368-11eb-a5de-005056b703af-osd.8
> ed76fafb1988  docker.io/ceph/ceph:v15               -n osd.0 -f
> --set...  6 hours ago     Up 6 hours ago
> ceph-d0920c36-2368-11eb-a5de-005056b703af-osd.0
> 
> I replaced the very old disks with some brand new SAMSUMG PM883 and
> would like to upgrade to 15.2.9. But the upgrade guide recommends to do
> this on a healthy cluster only. :)
> 
> Cheers,
> 
> Oliver
> 
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

-- 
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany
(HRB 36809, AG Nürnberg). Geschäftsführer: Felix Imendörffer

Attachment:
signature.asc

Description: OpenPGP digital signature
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx