Re: one of 3 monitors keeps going down

Sebastian Wagner <sewagner@xxxxxxxxxx> · Thu, 29 Apr 2021 13:44:22 +0200

Right, here are the docs for that workflow:

https://docs.ceph.com/en/latest/cephadm/mon/#mon-service

Am 29.04.21 um 13:13 schrieb Eugen Block:
Hi,

instead of copying MON data to this one did you also try to redeploy the 
MON container entirely so it gets a fresh start?

Zitat von "Robert W. Eckert" <rob@xxxxxxxxxxxxxxx>:

Hi,
On a daily basis, one of my monitors goes down

[root@cube ~]# ceph health detail
HEALTH_WARN 1 failed cephadm daemon(s); 1/3 mons down, quorum 
rhel1.robeckert.us,story
[WRN] CEPHADM_FAILED_DAEMON: 1 failed cephadm daemon(s)
    daemon mon.cube on cube.robeckert.us is in error state
[WRN] MON_DOWN: 1/3 mons down, quorum rhel1.robeckert.us,story
    mon.cube (rank 2) addr 
[v2:192.168.2.142:3300/0,v1:192.168.2.142:6789/0] is down (out of quorum)
[root@cube ~]# ceph --version
ceph version 15.2.11 (e3523634d9c2227df9af89a4eac33d16738c49cb) 
octopus (stable)

I have a script that will copy the mon data from another server and it 
restarts and runs well for a while.

It is always the same monitor, and when I look at the logs the only 
thing I really see is the cephadm log showing it down

2021-04-28 10:07:26,173 DEBUG Running command: /usr/bin/podman --version
2021-04-28 10:07:26,217 DEBUG /usr/bin/podman: stdout podman version 
2.2.1
2021-04-28 10:07:26,222 DEBUG Running command: /usr/bin/podman inspect 
--format {{.Id}},{{.Config.Image}},{{.Image}},{{.Created}},{{index 
.Config.Labels "io.ceph.version"}} 
ceph-fe3a7cb0-69ca-11eb-8d45-c86000d08867-osd.2
2021-04-28 10:07:26,326 DEBUG /usr/bin/podman: stdout 
fab17e5242eb4875e266df19ca89b596a2f2b1d470273a99ff71da2ae81eeb3c,docker.io/ceph/ceph:v15,5b724076c58f97872fc2f7701e8405ec809047d71528f79da452188daf2af72e,2021-04-26 
17:13:15.54183375 -0400 EDT,
2021-04-28 10:07:26,328 DEBUG Running command: systemctl is-enabled 
ceph-fe3a7cb0-69ca-11eb-8d45-c86000d08867@xxxxxxxx<mailto:ceph-fe3a7cb0-69ca-11eb-8d45-c86000d08867@xxxxxxxx> 

2021-04-28 10:07:26,334 DEBUG systemctl: stdout enabled
2021-04-28 10:07:26,335 DEBUG Running command: systemctl is-active 
ceph-fe3a7cb0-69ca-11eb-8d45-c86000d08867@xxxxxxxx<mailto:ceph-fe3a7cb0-69ca-11eb-8d45-c86000d08867@xxxxxxxx> 

2021-04-28 10:07:26,340 DEBUG systemctl: stdout failed
2021-04-28 10:07:26,340 DEBUG Running command: /usr/bin/podman --version
2021-04-28 10:07:26,395 DEBUG /usr/bin/podman: stdout podman version 
2.2.1
2021-04-28 10:07:26,402 DEBUG Running command: /usr/bin/podman inspect 
--format {{.Id}},{{.Config.Image}},{{.Image}},{{.Created}},{{index 
.Config.Labels "io.ceph.version"}} 
ceph-fe3a7cb0-69ca-11eb-8d45-c86000d08867-mon.cube
2021-04-28 10:07:26,526 DEBUG /usr/bin/podman: stdout 
04e7c673cbacf5160427b0c3eb2f0948b2f15d02c58bd1d9dd14f975a84cfc6f,docker.io/ceph/ceph:v15,5b724076c58f97872fc2f7701e8405ec809047d71528f79da452188daf2af72e,2021-04-28 
08:54:57.614847512 -0400 EDT,

I don't know if it matters, but this  server is an AMD 3600XT while my 
other two servers which have had no issues are intel based.

The root file system was originally on a SSD, and I switched to NVME, 
so I eliminated controller or drive issues.  (I didn't see anything in 
dmesg anyway)

If someone could point me in the right direction on where to 
troubleshoot next, I would appreciate it.

Thanks,
Rob Eckert
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx