Re: one of 3 monitors keeps going down

Eugen Block <eblock@xxxxxx> · Thu, 29 Apr 2021 11:13:57 +0000

Hi,

instead of copying MON data to this one did you also try to redeploy  
the MON container entirely so it gets a fresh start?

Zitat von "Robert W. Eckert" <rob@xxxxxxxxxxxxxxx>:

Hi,
On a daily basis, one of my monitors goes down

[root@cube ~]# ceph health detail
HEALTH_WARN 1 failed cephadm daemon(s); 1/3 mons down, quorum  
rhel1.robeckert.us,story
[WRN] CEPHADM_FAILED_DAEMON: 1 failed cephadm daemon(s)
    daemon mon.cube on cube.robeckert.us is in error state
[WRN] MON_DOWN: 1/3 mons down, quorum rhel1.robeckert.us,story
    mon.cube (rank 2) addr  
[v2:192.168.2.142:3300/0,v1:192.168.2.142:6789/0] is down (out of  
quorum)
[root@cube ~]# ceph --version
ceph version 15.2.11 (e3523634d9c2227df9af89a4eac33d16738c49cb)  
octopus (stable)

I have a script that will copy the mon data from another server and  
it restarts and runs well for a while.

It is always the same monitor, and when I look at the logs the only  
thing I really see is the cephadm log showing it down

2021-04-28 10:07:26,173 DEBUG Running command: /usr/bin/podman --version
2021-04-28 10:07:26,217 DEBUG /usr/bin/podman: stdout podman version 2.2.1
2021-04-28 10:07:26,222 DEBUG Running command: /usr/bin/podman  
inspect --format  
{{.Id}},{{.Config.Image}},{{.Image}},{{.Created}},{{index  
.Config.Labels "io.ceph.version"}}  
ceph-fe3a7cb0-69ca-11eb-8d45-c86000d08867-osd.2
2021-04-28 10:07:26,326 DEBUG /usr/bin/podman: stdout  
fab17e5242eb4875e266df19ca89b596a2f2b1d470273a99ff71da2ae81eeb3c,docker.io/ceph/ceph:v15,5b724076c58f97872fc2f7701e8405ec809047d71528f79da452188daf2af72e,2021-04-26 17:13:15.54183375 -0400  
EDT,
2021-04-28 10:07:26,328 DEBUG Running command: systemctl is-enabled  
ceph-fe3a7cb0-69ca-11eb-8d45-c86000d08867@xxxxxxxx<mailto:ceph-fe3a7cb0-69ca-11eb-8d45-c86000d08867@xxxxxxxx>
2021-04-28 10:07:26,334 DEBUG systemctl: stdout enabled
2021-04-28 10:07:26,335 DEBUG Running command: systemctl is-active  
ceph-fe3a7cb0-69ca-11eb-8d45-c86000d08867@xxxxxxxx<mailto:ceph-fe3a7cb0-69ca-11eb-8d45-c86000d08867@xxxxxxxx>
2021-04-28 10:07:26,340 DEBUG systemctl: stdout failed
2021-04-28 10:07:26,340 DEBUG Running command: /usr/bin/podman --version
2021-04-28 10:07:26,395 DEBUG /usr/bin/podman: stdout podman version 2.2.1
2021-04-28 10:07:26,402 DEBUG Running command: /usr/bin/podman  
inspect --format  
{{.Id}},{{.Config.Image}},{{.Image}},{{.Created}},{{index  
.Config.Labels "io.ceph.version"}}  
ceph-fe3a7cb0-69ca-11eb-8d45-c86000d08867-mon.cube
2021-04-28 10:07:26,526 DEBUG /usr/bin/podman: stdout  
04e7c673cbacf5160427b0c3eb2f0948b2f15d02c58bd1d9dd14f975a84cfc6f,docker.io/ceph/ceph:v15,5b724076c58f97872fc2f7701e8405ec809047d71528f79da452188daf2af72e,2021-04-28 08:54:57.614847512 -0400  
EDT,

I don't know if it matters, but this  server is an AMD 3600XT while  
my other two servers which have had no issues are intel based.

The root file system was originally on a SSD, and I switched to  
NVME, so I eliminated controller or drive issues.  (I didn't see  
anything in dmesg anyway)

If someone could point me in the right direction on where to  
troubleshoot next, I would appreciate it.

Thanks,
Rob Eckert
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx