Hi,
instead of copying MON data to this one did you also try to redeploy
the MON container entirely so it gets a fresh start?
Zitat von "Robert W. Eckert" <rob@xxxxxxxxxxxxxxx>:
Hi,
On a daily basis, one of my monitors goes down
[root@cube ~]# ceph health detail
HEALTH_WARN 1 failed cephadm daemon(s); 1/3 mons down, quorum
rhel1.robeckert.us,story
[WRN] CEPHADM_FAILED_DAEMON: 1 failed cephadm daemon(s)
daemon mon.cube on cube.robeckert.us is in error state
[WRN] MON_DOWN: 1/3 mons down, quorum rhel1.robeckert.us,story
mon.cube (rank 2) addr
[v2:192.168.2.142:3300/0,v1:192.168.2.142:6789/0] is down (out of
quorum)
[root@cube ~]# ceph --version
ceph version 15.2.11 (e3523634d9c2227df9af89a4eac33d16738c49cb)
octopus (stable)
I have a script that will copy the mon data from another server and
it restarts and runs well for a while.
It is always the same monitor, and when I look at the logs the only
thing I really see is the cephadm log showing it down
2021-04-28 10:07:26,173 DEBUG Running command: /usr/bin/podman --version
2021-04-28 10:07:26,217 DEBUG /usr/bin/podman: stdout podman version 2.2.1
2021-04-28 10:07:26,222 DEBUG Running command: /usr/bin/podman
inspect --format
{{.Id}},{{.Config.Image}},{{.Image}},{{.Created}},{{index
.Config.Labels "io.ceph.version"}}
ceph-fe3a7cb0-69ca-11eb-8d45-c86000d08867-osd.2
2021-04-28 10:07:26,326 DEBUG /usr/bin/podman: stdout
fab17e5242eb4875e266df19ca89b596a2f2b1d470273a99ff71da2ae81eeb3c,docker.io/ceph/ceph:v15,5b724076c58f97872fc2f7701e8405ec809047d71528f79da452188daf2af72e,2021-04-26 17:13:15.54183375 -0400
EDT,
2021-04-28 10:07:26,328 DEBUG Running command: systemctl is-enabled
ceph-fe3a7cb0-69ca-11eb-8d45-c86000d08867@xxxxxxxx<mailto:ceph-fe3a7cb0-69ca-11eb-8d45-c86000d08867@xxxxxxxx>
2021-04-28 10:07:26,334 DEBUG systemctl: stdout enabled
2021-04-28 10:07:26,335 DEBUG Running command: systemctl is-active
ceph-fe3a7cb0-69ca-11eb-8d45-c86000d08867@xxxxxxxx<mailto:ceph-fe3a7cb0-69ca-11eb-8d45-c86000d08867@xxxxxxxx>
2021-04-28 10:07:26,340 DEBUG systemctl: stdout failed
2021-04-28 10:07:26,340 DEBUG Running command: /usr/bin/podman --version
2021-04-28 10:07:26,395 DEBUG /usr/bin/podman: stdout podman version 2.2.1
2021-04-28 10:07:26,402 DEBUG Running command: /usr/bin/podman
inspect --format
{{.Id}},{{.Config.Image}},{{.Image}},{{.Created}},{{index
.Config.Labels "io.ceph.version"}}
ceph-fe3a7cb0-69ca-11eb-8d45-c86000d08867-mon.cube
2021-04-28 10:07:26,526 DEBUG /usr/bin/podman: stdout
04e7c673cbacf5160427b0c3eb2f0948b2f15d02c58bd1d9dd14f975a84cfc6f,docker.io/ceph/ceph:v15,5b724076c58f97872fc2f7701e8405ec809047d71528f79da452188daf2af72e,2021-04-28 08:54:57.614847512 -0400
EDT,
I don't know if it matters, but this server is an AMD 3600XT while
my other two servers which have had no issues are intel based.
The root file system was originally on a SSD, and I switched to
NVME, so I eliminated controller or drive issues. (I didn't see
anything in dmesg anyway)
If someone could point me in the right direction on where to
troubleshoot next, I would appreciate it.
Thanks,
Rob Eckert
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx