Re: Unhealthy Cluster | Remove / Purge duplicate osds | Fix daemon

Oliver Weinmann <oliver.weinmann@xxxxxx> · Mon, 15 Mar 2021 17:24:16 -0000

Hi Sebastian,

thanks that seems to have worked. At least on one of the two nodes. But now I have another problem. It seems that all mgr daemons are gone and ceph command is stuck.

[root@gedasvl02 ~]# cephadm ls | grep mgr

I tried to deploy a new mgr but this doesn't seem to work either:

[root@gedasvl02 ~]# cephadm ls | grep mgr
[root@gedasvl02 ~]# cephadm deploy --fsid d0920c36-2368-11eb-a5de-005056b703af --name mgr.gedaopl03
INFO:cephadm:Deploy daemon mgr.gedaopl03 ...

At least I can't see a mgr container on node gedaopl03:

[root@gedaopl03 ~]# podman ps
CONTAINER ID  IMAGE                                 COMMAND               CREATED     STATUS         PORTS  NAMES
63518d95201b  docker.io/prom/node-exporter:v0.18.1  --no-collector.ti...  3 days ago  Up 3 days ago         ceph-d0920c36-2368-11eb-a5de-005056b703af-node-exporter.gedaopl03
aa9b57fd77b8  docker.io/ceph/ceph:v15               -n client.crash.g...  3 days ago  Up 3 days ago         ceph-d0920c36-2368-11eb-a5de-005056b703af-crash.gedaopl03
8b02715f9cb4  docker.io/ceph/ceph:v15               -n osd.2 -f --set...  3 days ago  Up 3 days ago         ceph-d0920c36-2368-11eb-a5de-005056b703af-osd.2
40f15a6357fe  docker.io/ceph/ceph:v15               -n osd.7 -f --set...  3 days ago  Up 3 days ago         ceph-d0920c36-2368-11eb-a5de-005056b703af-osd.7
bda260378239  docker.io/ceph/ceph:v15               -n mds.cephfs.ged...  3 days ago  Up 3 days ago         ceph-d0920c36-2368-11eb-a5de-005056b703af-mds.cephfs.gedaopl03.kybzgy
[root@gedaopl03 ~]# systemctl --failed
  UNIT                                                                      LOAD   ACTIVE SUB    DESCRIPTION
● ceph-d0920c36-2368-11eb-a5de-005056b703af@crash.gedaopl03.service         loaded failed failed Ceph crash.gedaopl03 for d0920c36-2368-11eb-a5de-005056b703af
● ceph-d0920c36-2368-11eb-a5de-005056b703af@mon.gedaopl03.service           loaded failed failed Ceph mon.gedaopl03 for d0920c36-2368-11eb-a5de-005056b703af
● ceph-d0920c36-2368-11eb-a5de-005056b703af@node-exporter.gedaopl03.service loaded failed failed Ceph node-exporter.gedaopl03 for d0920c36-2368-11eb-a5de-005056b703af
● ceph-d0920c36-2368-11eb-a5de-005056b703af@osd.3.service                   loaded failed failed Ceph osd.3 for d0920c36-2368-11eb-a5de-005056b703af

LOAD   = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB    = The low-level unit activation state, values depend on unit type.

4 loaded units listed. Pass --all to see loaded but inactive units, too.
To show all installed unit files use 'systemctl list-unit-files'.

Maybe it's best to just scrap the whole cluster. It is only for testing, but I guess it is also a good practice for recovery. :)

Am 12. März 2021 um 12:35 schrieb Sebastian Wagner <swagner@xxxxxxxx>:

Hi Oliver,

# ssh gedaopl02
# cephadm rm-daemon osd.0

should do the trick.

Be careful to remove the broken OSD :-)

Best,

Sebastian

Am 11.03.21 um 22:10 schrieb Oliver Weinmann:

Hi,

On my 3 node Octopus 15.2.5 test cluster, that I haven't used for quite
a while, I noticed that it shows some errors:

[root@gedasvl02 ~]# ceph health detail
INFO:cephadm:Inferring fsid d0920c36-2368-11eb-a5de-005056b703af
INFO:cephadm:Inferring config
/var/lib/ceph/d0920c36-2368-11eb-a5de-005056b703af/mon.gedasvl02/config
INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15
HEALTH_WARN 2 failed cephadm daemon(s)
[WRN] CEPHADM_FAILED_DAEMON: 2 failed cephadm daemon(s)
    daemon osd.0 on gedaopl02 is in error state
    daemon node-exporter.gedaopl01 on gedaopl01 is in error state

The error about the osd.0 is strange since osd.0 is actually up and
running but on a different node. I guess I missed to correctly remove it
from node gedaopl02 and then added a new osd to a different node
gedaopl01 and now there are duplicate osd ids for osd.0 and osd.2.

[root@gedasvl02 ~]# ceph orch ps
INFO:cephadm:Inferring fsid d0920c36-2368-11eb-a5de-005056b703af
INFO:cephadm:Inferring config
/var/lib/ceph/d0920c36-2368-11eb-a5de-005056b703af/mon.gedasvl02/config
INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15
NAME                         HOST       STATUS        REFRESHED AGE 
VERSION    IMAGE NAME                            IMAGE ID      CONTAINER ID
alertmanager.gedasvl02       gedasvl02  running (6h)  7m ago 4M  
0.20.0     docker.io/prom/alertmanager:v0.20.0 0881eb8f169f  5b80fb977a5f
crash.gedaopl01              gedaopl01  stopped       7m ago 4M  
15.2.5     docker.io/ceph/ceph:v15 4405f6339e35  810cf432b6d6
crash.gedaopl02              gedaopl02  running (5h)  7m ago 4M  
15.2.5     docker.io/ceph/ceph:v15 4405f6339e35  34ab264fd5ed
crash.gedaopl03              gedaopl03  running (2d)  7m ago 2d  
15.2.9     docker.io/ceph/ceph:v15 dfc483079636  233f30086d2d
crash.gedasvl02              gedasvl02  running (6h)  7m ago 4M  
15.2.5     docker.io/ceph/ceph:v15 4405f6339e35  ea3d3e7c4f58
grafana.gedasvl02            gedasvl02  running (6h)  7m ago 4M  
6.6.2      docker.io/ceph/ceph-grafana:6.6.2 a0dce381714a  5a94f3e41c32
mds.cephfs.gedaopl01.zjuhem  gedaopl01  stopped       7m ago 3M  
<unknown>  docker.io/ceph/ceph:v15 <unknown>     <unknown>
mds.cephfs.gedasvl02.xsjtpi  gedasvl02  running (6h)  7m ago 3M  
15.2.5     docker.io/ceph/ceph:v15 4405f6339e35  26e7c8759d89
mgr.gedaopl03.zilwbl         gedaopl03  running (7h)  7m ago 7h  
15.2.9     docker.io/ceph/ceph:v15 dfc483079636  e18b6f40871c
mon.gedaopl03                gedaopl03  running (7h)  7m ago 7h  
15.2.9     docker.io/ceph/ceph:v15 dfc483079636  5afdf40e41ba
mon.gedasvl02                gedasvl02  running (6h)  7m ago 4M  
15.2.5     docker.io/ceph/ceph:v15 4405f6339e35  e83dfcd864aa
node-exporter.gedaopl01      gedaopl01  error         7m ago 4M  
0.18.1     docker.io/prom/node-exporter:v0.18.1 e5a616e4b9cf  0fefcfcc9639
node-exporter.gedaopl02      gedaopl02  running (5h)  7m ago 4M  
0.18.1     docker.io/prom/node-exporter:v0.18.1 e5a616e4b9cf  f459045b7e41
node-exporter.gedaopl03      gedaopl03  running (2d)  7m ago 2d  
0.18.1     docker.io/prom/node-exporter:v0.18.1 e5a616e4b9cf  3bd9f8dd6d5b
node-exporter.gedasvl02      gedasvl02  running (6h)  7m ago 4M  
0.18.1     docker.io/prom/node-exporter:v0.18.1 e5a616e4b9cf  72e96963261e
*osd.0                        gedaopl01  running (5h)  7m ago     5h  
15.2.5     docker.io/ceph/ceph:v15 4405f6339e35  ed76fafb1988**
**osd.0                        gedaopl02  error         7m ago     4M  
<unknown> docker.io/ceph/ceph:v15               <unknown> <unknown>*
osd.1                        gedaopl01  running (4h)  7m ago 3d  
15.2.5     docker.io/ceph/ceph:v15 4405f6339e35  41a43733e601
*osd.2                        gedaopl01  stopped       7m ago     4M  
<unknown> docker.io/ceph/ceph:v15               <unknown> <unknown>**
**osd.2                        gedaopl03  running (7h)  7m ago     7h  
15.2.9     docker.io/ceph/ceph:v15 dfc483079636  ac9e660db2fb*
osd.3                        gedaopl03  running (7h)  7m ago 7h  
15.2.9     docker.io/ceph/ceph:v15 dfc483079636  bde17b5bb2fb
osd.4                        gedaopl02  running (5h)  7m ago 3d  
15.2.5     docker.io/ceph/ceph:v15 4405f6339e35  7cc3ef7c4469
osd.5                        gedaopl02  running (5h)  7m ago 3d  
15.2.5     docker.io/ceph/ceph:v15 4405f6339e35  761b96d235e4
osd.6                        gedaopl02  running (5h)  7m ago 3d  
15.2.5     docker.io/ceph/ceph:v15 4405f6339e35  d047b28fe2bd
osd.7                        gedaopl03  running (7h)  7m ago 7h  
15.2.9     docker.io/ceph/ceph:v15 dfc483079636  3b54b01841f4
osd.8                        gedaopl01  running (5h)  7m ago 5h  
15.2.5     docker.io/ceph/ceph:v15 4405f6339e35  cdd308cdc82b
prometheus.gedasvl02         gedasvl02  running (5h)  7m ago 4M  
2.18.1     docker.io/prom/prometheus:v2.18.1 de242295e225  591cef3bbaa4

Is there a way to clean / purge the stopped and error ones?

I don't know what is wrong with the node-exporter. Because looking at
podman ps -a on gedaopl01 looks ok. Maybe also a zombie daemon?

[root@gedaopl01 ~]# podman ps -a
CONTAINER ID  IMAGE COMMAND               CREATED        
STATUS             PORTS NAMES
e71898f7d038  docker.io/prom/node-exporter:v0.18.1 --no-collector.ti... 
54 seconds ago  Up 54 seconds ago
ceph-d0920c36-2368-11eb-a5de-005056b703af-node-exporter.gedaopl01
41a43733e601  docker.io/ceph/ceph:v15               -n osd.1 -f
--set...  5 hours ago     Up 5 hours ago
ceph-d0920c36-2368-11eb-a5de-005056b703af-osd.1
810cf432b6d6  docker.io/ceph/ceph:v15               -n
client.crash.g...  6 hours ago     Up 6 hours ago
ceph-d0920c36-2368-11eb-a5de-005056b703af-crash.gedaopl01
cdd308cdc82b  docker.io/ceph/ceph:v15               -n osd.8 -f
--set...  6 hours ago     Up 6 hours ago
ceph-d0920c36-2368-11eb-a5de-005056b703af-osd.8
ed76fafb1988  docker.io/ceph/ceph:v15               -n osd.0 -f
--set...  6 hours ago     Up 6 hours ago
ceph-d0920c36-2368-11eb-a5de-005056b703af-osd.0

I replaced the very old disks with some brand new SAMSUMG PM883 and
would like to upgrade to 15.2.9. But the upgrade guide recommends to do
this on a healthy cluster only. :)

Cheers,

Oliver

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx

To unsubscribe send an email to ceph-users-leave@xxxxxxx

--
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany
(HRB 36809, AG Nürnberg). Geschäftsführer: Felix Imendörffer

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx