Re: Failing to restart mon and mgr daemons on Pacific

Renata Callado Borges <renato.callado@xxxxxxxxxxxx> · Tue, 25 Jul 2023 12:51:59 -0300

Hi Adam!

Thank you for your response, but I am still trying to figure out the 
issue. I am pretty sure the problem occurs "inside" the container, and I 
don´t  know how to get logs from there.

Just in case, this is what systemd sees:

Jul 25 12:36:32 darkside1 systemd[1]: Stopped Ceph mon.darkside1 for 
920740ee-cf2d-11ed-9097-08c0eb320eda.
Jul 25 12:36:32 darkside1 systemd[1]: Starting Ceph mon.darkside1 for 
920740ee-cf2d-11ed-9097-08c0eb320eda...
Jul 25 12:36:33 darkside1 bash[52271]: 
ceph-920740ee-cf2d-11ed-9097-08c0eb320eda-mon.darkside1
Jul 25 12:36:33 darkside1 podman[52311]: 2023-07-25 12:36:33.32233695 
-0300 -03 m=+0.131005321 container create 
7cf1d340e0a9658b2c9dac3a039a6f5b4fd2a5581bb60213b0ab708d24b69f54 
(image=quay.io/ceph/ceph:v15, 
name=ceph-920740ee-cf2d-11ed-9097-08c0eb320eda-mon.darkside1)
Jul 25 12:36:33 darkside1 podman[52311]: 2023-07-25 12:36:33.526241218 
-0300 -03 m=+0.334909578 container init 7cf1d340e0a9658b2c9da
c3a039a6f5b4fd2a5581bb60213b0ab708d24b69f54 
(image=quay.io/ceph/ceph:v15, 
name=ceph-920740ee-cf2d-11ed-9097-08c0eb320eda-mon.darksid
e1)
Jul 25 12:36:33 darkside1 podman[52311]: 2023-07-25 12:36:33.556646854 
-0300 -03 m=+0.365315225 container start 7cf1d340e0a9658b2c9d
ac3a039a6f5b4fd2a5581bb60213b0ab708d24b69f54 
(image=quay.io/ceph/ceph:v15, 
name=ceph-920740ee-cf2d-11ed-9097-08c0eb320eda-mon.darksi
de1)
Jul 25 12:36:33 darkside1 bash[52271]: 
7cf1d340e0a9658b2c9dac3a039a6f5b4fd2a5581bb60213b0ab708d24b69f54
Jul 25 12:36:33 darkside1 systemd[1]: Started Ceph mon.darkside1 for 
920740ee-cf2d-11ed-9097-08c0eb320eda.
Jul 25 12:36:43 darkside1 systemd[1]: 
ceph-920740ee-cf2d-11ed-9097-08c0eb320eda@mon.darkside1.service: main 
process exited, code=exi
ted, status=1/FAILURE
Jul 25 12:36:43 darkside1 systemd[1]: Unit 
ceph-920740ee-cf2d-11ed-9097-08c0eb320eda@mon.darkside1.service entered 
failed state.
Jul 25 12:36:43 darkside1 systemd[1]: 
ceph-920740ee-cf2d-11ed-9097-08c0eb320eda@mon.darkside1.service failed.
Jul 25 12:36:53 darkside1 systemd[1]: 
ceph-920740ee-cf2d-11ed-9097-08c0eb320eda@mon.darkside1.service holdoff 
time over, scheduling
restart.
Jul 25 12:36:53 darkside1 systemd[1]: Stopped Ceph mon.darkside1 for 
920740ee-cf2d-11ed-9097-08c0eb320eda.
Jul 25 12:36:53 darkside1 systemd[1]: start request repeated too quickly 
for ceph-920740ee-cf2d-11ed-9097-08c0eb320eda@mon.darkside1
.service
Jul 25 12:36:53 darkside1 systemd[1]: Failed to start Ceph mon.darkside1 
for 920740ee-cf2d-11ed-9097-08c0eb320eda.
Jul 25 12:36:53 darkside1 systemd[1]: Unit 
ceph-920740ee-cf2d-11ed-9097-08c0eb320eda@mon.darkside1.service entered 
failed state.
Jul 25 12:36:53 darkside1 systemd[1]: 
ceph-920740ee-cf2d-11ed-9097-08c0eb320eda@mon.darkside1.service failed.

Also, I get the following error every 10 minutes or so on "ceph -W 
cephadm --watch-debug":

2023-07-25T12:35:38.115146-0300 mgr.darkside3.ujjyun [INF] Deploying 
daemon node-exporter.darkside1 on darkside1
2023-07-25T12:35:38.612569-0300 mgr.darkside3.ujjyun [ERR] cephadm 
exited with an error code: 1, stderr:Deploy daemon node-exporter.
darkside1 ...
Verifying port 9100 ...
Cannot bind to IP 0.0.0.0 port 9100: [Errno 98] Address already in use
ERROR: TCP Port(s) '9100' required for node-exporter already in use
Traceback (most recent call last):
  File "/usr/share/ceph/mgr/cephadm/module.py", line 1029, in 
_remote_connection
    yield (conn, connr)
  File "/usr/share/ceph/mgr/cephadm/module.py", line 1185, in _run_cephadm
    code, '\n'.join(err)))
orchestrator._interface.OrchestratorError: cephadm exited with an error 
code: 1, stderr:Deploy daemon node-exporter.darkside1 ...
Verifying port 9100 ...
Cannot bind to IP 0.0.0.0 port 9100: [Errno 98] Address already in use
ERROR: TCP Port(s) '9100' required for node-exporter already in use

And finally I get this error on the first line of output for my "ceph 
mon dump":

2023-07-25T12:46:17.008-0300 7f145f59e700 -1 monclient(hunting): 
handle_auth_bad_method server allowed_methods [2] but i only suppor
t [2,1]

Cordially,

Renata.

On 7/24/23 10:57, Adam King wrote:
The logs you probably really want to look at here are the journal logs 
from the mgr and mon. If you have a copy of the cephadm tool on the 
host, you can do a "cephadm ls --no-detail | grep systemd" to list out 
the systemd unit names for the ceph daemons on the host, or just look 
find the systemd unit names in the standard way you would for any 
other systemd unit (e.g. "systemctl -l | grep mgr'' will probably 
include the mgr one) and then take a look at "journalctl -eu 
<systemd-unit-name>" for the systemd unit for both the mgr and the 
mon. I'd expect near the end of the log it would include a reason for 
going down.

As for the debug_ms (I think that's what you want over "debug mon") 
stuff, I think that would need to be a command line option for the 
mgr/mon process. For cephadm deployments, the systemd unit is run 
through a "unit.run" file in 
/var/lib/ceph/<cluster-fsid>/<daemon-name>/unit.run. If you go to the 
very end of that file, which will be a very long podman or docker run 
command, add in the "--debug_ms 20" and then restart the systemd unit 
for that daemon, it should cause the extra debug logging to happen 
from that daemon. I would say first check if there are useful errors 
in the journal logs mentioned above before trying that though.

On Mon, Jul 24, 2023 at 9:47 AM Renata Callado Borges 
<renato.callado@xxxxxxxxxxxx> wrote:

    Dear all,

    How are you?

    I have a cluster on Pacific with 3 hosts, each one with 1 mon,  1 mgr
    and 12 OSDs.

    One of the hosts, darkside1, has been out of quorum according to ceph
    status.

    Systemd showed 4 services dead, two mons and two mgrs.

    I managed to systemctl restart one mon and one mgr, but even after
    several attempts, the remaining mon and mgr services, when asked to
    restart, keep returning to a failed state after a few seconds.
    They try
    to auto-restart and then go into a failed state where systemd
    requires
    me to manually set them to "reset-failed" before trying to start
    again.
    But they never stay up. There are no clear messages about the
    issue in
    /var/log/ceph/cephadm.log.

    The host is still out of quorum.

    I have failed to "turn on debug" as per
    https://docs.ceph.com/en/pacific/rados/troubleshooting/log-and-debug/.

    It seems I do not know the proper incantantion for "ceph daemon X
    config
    show", no string for X seems to satisfy this command. I have tried
    adding this:

    [mon]

          debug mon = 20

    To my ceph.conf, but no additional lines of log are sent to
    /var/log/cephadm.log

      so I'm sorry I can´t provide more details.

    Could someone help me debug this situation? I am sure that if just
    reboot the machine, it will start up the services properly, as it
    always
    has done, but I would prefer to fix this without this action.

    Cordially,

    Renata.
    _______________________________________________
    ceph-users mailing list -- ceph-users@xxxxxxx
    To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx