Hi Adam!
Thank you for your response, but I am still trying to figure out the
issue. I am pretty sure the problem occurs "inside" the container, and I
don´t know how to get logs from there.
Just in case, this is what systemd sees:
Jul 25 12:36:32 darkside1 systemd[1]: Stopped Ceph mon.darkside1 for
920740ee-cf2d-11ed-9097-08c0eb320eda.
Jul 25 12:36:32 darkside1 systemd[1]: Starting Ceph mon.darkside1 for
920740ee-cf2d-11ed-9097-08c0eb320eda...
Jul 25 12:36:33 darkside1 bash[52271]:
ceph-920740ee-cf2d-11ed-9097-08c0eb320eda-mon.darkside1
Jul 25 12:36:33 darkside1 podman[52311]: 2023-07-25 12:36:33.32233695
-0300 -03 m=+0.131005321 container create
7cf1d340e0a9658b2c9dac3a039a6f5b4fd2a5581bb60213b0ab708d24b69f54
(image=quay.io/ceph/ceph:v15,
name=ceph-920740ee-cf2d-11ed-9097-08c0eb320eda-mon.darkside1)
Jul 25 12:36:33 darkside1 podman[52311]: 2023-07-25 12:36:33.526241218
-0300 -03 m=+0.334909578 container init 7cf1d340e0a9658b2c9da
c3a039a6f5b4fd2a5581bb60213b0ab708d24b69f54
(image=quay.io/ceph/ceph:v15,
name=ceph-920740ee-cf2d-11ed-9097-08c0eb320eda-mon.darksid
e1)
Jul 25 12:36:33 darkside1 podman[52311]: 2023-07-25 12:36:33.556646854
-0300 -03 m=+0.365315225 container start 7cf1d340e0a9658b2c9d
ac3a039a6f5b4fd2a5581bb60213b0ab708d24b69f54
(image=quay.io/ceph/ceph:v15,
name=ceph-920740ee-cf2d-11ed-9097-08c0eb320eda-mon.darksi
de1)
Jul 25 12:36:33 darkside1 bash[52271]:
7cf1d340e0a9658b2c9dac3a039a6f5b4fd2a5581bb60213b0ab708d24b69f54
Jul 25 12:36:33 darkside1 systemd[1]: Started Ceph mon.darkside1 for
920740ee-cf2d-11ed-9097-08c0eb320eda.
Jul 25 12:36:43 darkside1 systemd[1]:
ceph-920740ee-cf2d-11ed-9097-08c0eb320eda@mon.darkside1.service: main
process exited, code=exi
ted, status=1/FAILURE
Jul 25 12:36:43 darkside1 systemd[1]: Unit
ceph-920740ee-cf2d-11ed-9097-08c0eb320eda@mon.darkside1.service entered
failed state.
Jul 25 12:36:43 darkside1 systemd[1]:
ceph-920740ee-cf2d-11ed-9097-08c0eb320eda@mon.darkside1.service failed.
Jul 25 12:36:53 darkside1 systemd[1]:
ceph-920740ee-cf2d-11ed-9097-08c0eb320eda@mon.darkside1.service holdoff
time over, scheduling
restart.
Jul 25 12:36:53 darkside1 systemd[1]: Stopped Ceph mon.darkside1 for
920740ee-cf2d-11ed-9097-08c0eb320eda.
Jul 25 12:36:53 darkside1 systemd[1]: start request repeated too quickly
for ceph-920740ee-cf2d-11ed-9097-08c0eb320eda@mon.darkside1
.service
Jul 25 12:36:53 darkside1 systemd[1]: Failed to start Ceph mon.darkside1
for 920740ee-cf2d-11ed-9097-08c0eb320eda.
Jul 25 12:36:53 darkside1 systemd[1]: Unit
ceph-920740ee-cf2d-11ed-9097-08c0eb320eda@mon.darkside1.service entered
failed state.
Jul 25 12:36:53 darkside1 systemd[1]:
ceph-920740ee-cf2d-11ed-9097-08c0eb320eda@mon.darkside1.service failed.
Also, I get the following error every 10 minutes or so on "ceph -W
cephadm --watch-debug":
2023-07-25T12:35:38.115146-0300 mgr.darkside3.ujjyun [INF] Deploying
daemon node-exporter.darkside1 on darkside1
2023-07-25T12:35:38.612569-0300 mgr.darkside3.ujjyun [ERR] cephadm
exited with an error code: 1, stderr:Deploy daemon node-exporter.
darkside1 ...
Verifying port 9100 ...
Cannot bind to IP 0.0.0.0 port 9100: [Errno 98] Address already in use
ERROR: TCP Port(s) '9100' required for node-exporter already in use
Traceback (most recent call last):
File "/usr/share/ceph/mgr/cephadm/module.py", line 1029, in
_remote_connection
yield (conn, connr)
File "/usr/share/ceph/mgr/cephadm/module.py", line 1185, in _run_cephadm
code, '\n'.join(err)))
orchestrator._interface.OrchestratorError: cephadm exited with an error
code: 1, stderr:Deploy daemon node-exporter.darkside1 ...
Verifying port 9100 ...
Cannot bind to IP 0.0.0.0 port 9100: [Errno 98] Address already in use
ERROR: TCP Port(s) '9100' required for node-exporter already in use
And finally I get this error on the first line of output for my "ceph
mon dump":
2023-07-25T12:46:17.008-0300 7f145f59e700 -1 monclient(hunting):
handle_auth_bad_method server allowed_methods [2] but i only suppor
t [2,1]
Cordially,
Renata.
On 7/24/23 10:57, Adam King wrote:
The logs you probably really want to look at here are the journal logs
from the mgr and mon. If you have a copy of the cephadm tool on the
host, you can do a "cephadm ls --no-detail | grep systemd" to list out
the systemd unit names for the ceph daemons on the host, or just look
find the systemd unit names in the standard way you would for any
other systemd unit (e.g. "systemctl -l | grep mgr'' will probably
include the mgr one) and then take a look at "journalctl -eu
<systemd-unit-name>" for the systemd unit for both the mgr and the
mon. I'd expect near the end of the log it would include a reason for
going down.
As for the debug_ms (I think that's what you want over "debug mon")
stuff, I think that would need to be a command line option for the
mgr/mon process. For cephadm deployments, the systemd unit is run
through a "unit.run" file in
/var/lib/ceph/<cluster-fsid>/<daemon-name>/unit.run. If you go to the
very end of that file, which will be a very long podman or docker run
command, add in the "--debug_ms 20" and then restart the systemd unit
for that daemon, it should cause the extra debug logging to happen
from that daemon. I would say first check if there are useful errors
in the journal logs mentioned above before trying that though.
On Mon, Jul 24, 2023 at 9:47 AM Renata Callado Borges
<renato.callado@xxxxxxxxxxxx> wrote:
Dear all,
How are you?
I have a cluster on Pacific with 3 hosts, each one with 1 mon, 1 mgr
and 12 OSDs.
One of the hosts, darkside1, has been out of quorum according to ceph
status.
Systemd showed 4 services dead, two mons and two mgrs.
I managed to systemctl restart one mon and one mgr, but even after
several attempts, the remaining mon and mgr services, when asked to
restart, keep returning to a failed state after a few seconds.
They try
to auto-restart and then go into a failed state where systemd
requires
me to manually set them to "reset-failed" before trying to start
again.
But they never stay up. There are no clear messages about the
issue in
/var/log/ceph/cephadm.log.
The host is still out of quorum.
I have failed to "turn on debug" as per
https://docs.ceph.com/en/pacific/rados/troubleshooting/log-and-debug/.
It seems I do not know the proper incantantion for "ceph daemon X
config
show", no string for X seems to satisfy this command. I have tried
adding this:
[mon]
debug mon = 20
To my ceph.conf, but no additional lines of log are sent to
/var/log/cephadm.log
so I'm sorry I can´t provide more details.
Could someone help me debug this situation? I am sure that if just
reboot the machine, it will start up the services properly, as it
always
has done, but I would prefer to fix this without this action.
Cordially,
Renata.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx