Re: Failing to restart mon and mgr daemons on Pacific

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Adam!


Thank you for your response, but I am still trying to figure out the issue. I am pretty sure the problem occurs "inside" the container, and I don´t  know how to get logs from there.

Just in case, this is what systemd sees:


Jul 25 12:36:32 darkside1 systemd[1]: Stopped Ceph mon.darkside1 for 920740ee-cf2d-11ed-9097-08c0eb320eda. Jul 25 12:36:32 darkside1 systemd[1]: Starting Ceph mon.darkside1 for 920740ee-cf2d-11ed-9097-08c0eb320eda... Jul 25 12:36:33 darkside1 bash[52271]: ceph-920740ee-cf2d-11ed-9097-08c0eb320eda-mon.darkside1 Jul 25 12:36:33 darkside1 podman[52311]: 2023-07-25 12:36:33.32233695 -0300 -03 m=+0.131005321 container create 7cf1d340e0a9658b2c9dac3a039a6f5b4fd2a5581bb60213b0ab708d24b69f54 (image=quay.io/ceph/ceph:v15, name=ceph-920740ee-cf2d-11ed-9097-08c0eb320eda-mon.darkside1) Jul 25 12:36:33 darkside1 podman[52311]: 2023-07-25 12:36:33.526241218 -0300 -03 m=+0.334909578 container init 7cf1d340e0a9658b2c9da c3a039a6f5b4fd2a5581bb60213b0ab708d24b69f54 (image=quay.io/ceph/ceph:v15, name=ceph-920740ee-cf2d-11ed-9097-08c0eb320eda-mon.darksid
e1)
Jul 25 12:36:33 darkside1 podman[52311]: 2023-07-25 12:36:33.556646854 -0300 -03 m=+0.365315225 container start 7cf1d340e0a9658b2c9d ac3a039a6f5b4fd2a5581bb60213b0ab708d24b69f54 (image=quay.io/ceph/ceph:v15, name=ceph-920740ee-cf2d-11ed-9097-08c0eb320eda-mon.darksi
de1)
Jul 25 12:36:33 darkside1 bash[52271]: 7cf1d340e0a9658b2c9dac3a039a6f5b4fd2a5581bb60213b0ab708d24b69f54 Jul 25 12:36:33 darkside1 systemd[1]: Started Ceph mon.darkside1 for 920740ee-cf2d-11ed-9097-08c0eb320eda. Jul 25 12:36:43 darkside1 systemd[1]: ceph-920740ee-cf2d-11ed-9097-08c0eb320eda@mon.darkside1.service: main process exited, code=exi
ted, status=1/FAILURE
Jul 25 12:36:43 darkside1 systemd[1]: Unit ceph-920740ee-cf2d-11ed-9097-08c0eb320eda@mon.darkside1.service entered failed state. Jul 25 12:36:43 darkside1 systemd[1]: ceph-920740ee-cf2d-11ed-9097-08c0eb320eda@mon.darkside1.service failed. Jul 25 12:36:53 darkside1 systemd[1]: ceph-920740ee-cf2d-11ed-9097-08c0eb320eda@mon.darkside1.service holdoff time over, scheduling
restart.
Jul 25 12:36:53 darkside1 systemd[1]: Stopped Ceph mon.darkside1 for 920740ee-cf2d-11ed-9097-08c0eb320eda. Jul 25 12:36:53 darkside1 systemd[1]: start request repeated too quickly for ceph-920740ee-cf2d-11ed-9097-08c0eb320eda@mon.darkside1
.service
Jul 25 12:36:53 darkside1 systemd[1]: Failed to start Ceph mon.darkside1 for 920740ee-cf2d-11ed-9097-08c0eb320eda. Jul 25 12:36:53 darkside1 systemd[1]: Unit ceph-920740ee-cf2d-11ed-9097-08c0eb320eda@mon.darkside1.service entered failed state. Jul 25 12:36:53 darkside1 systemd[1]: ceph-920740ee-cf2d-11ed-9097-08c0eb320eda@mon.darkside1.service failed.


Also, I get the following error every 10 minutes or so on "ceph -W cephadm --watch-debug":


2023-07-25T12:35:38.115146-0300 mgr.darkside3.ujjyun [INF] Deploying daemon node-exporter.darkside1 on darkside1 2023-07-25T12:35:38.612569-0300 mgr.darkside3.ujjyun [ERR] cephadm exited with an error code: 1, stderr:Deploy daemon node-exporter.
darkside1 ...
Verifying port 9100 ...
Cannot bind to IP 0.0.0.0 port 9100: [Errno 98] Address already in use
ERROR: TCP Port(s) '9100' required for node-exporter already in use
Traceback (most recent call last):
  File "/usr/share/ceph/mgr/cephadm/module.py", line 1029, in _remote_connection
    yield (conn, connr)
  File "/usr/share/ceph/mgr/cephadm/module.py", line 1185, in _run_cephadm
    code, '\n'.join(err)))
orchestrator._interface.OrchestratorError: cephadm exited with an error code: 1, stderr:Deploy daemon node-exporter.darkside1 ...
Verifying port 9100 ...
Cannot bind to IP 0.0.0.0 port 9100: [Errno 98] Address already in use
ERROR: TCP Port(s) '9100' required for node-exporter already in use

And finally I get this error on the first line of output for my "ceph mon dump":

2023-07-25T12:46:17.008-0300 7f145f59e700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only suppor
t [2,1]


Cordially,

Renata.

On 7/24/23 10:57, Adam King wrote:
The logs you probably really want to look at here are the journal logs from the mgr and mon. If you have a copy of the cephadm tool on the host, you can do a "cephadm ls --no-detail | grep systemd" to list out the systemd unit names for the ceph daemons on the host, or just look find the systemd unit names in the standard way you would for any other systemd unit (e.g. "systemctl -l | grep mgr'' will probably include the mgr one) and then take a look at "journalctl -eu <systemd-unit-name>" for the systemd unit for both the mgr and the mon. I'd expect near the end of the log it would include a reason for going down.

As for the debug_ms (I think that's what you want over "debug mon") stuff, I think that would need to be a command line option for the mgr/mon process. For cephadm deployments, the systemd unit is run through a "unit.run" file in /var/lib/ceph/<cluster-fsid>/<daemon-name>/unit.run. If you go to the very end of that file, which will be a very long podman or docker run command, add in the "--debug_ms 20" and then restart the systemd unit for that daemon, it should cause the extra debug logging to happen from that daemon. I would say first check if there are useful errors in the journal logs mentioned above before trying that though.

On Mon, Jul 24, 2023 at 9:47 AM Renata Callado Borges <renato.callado@xxxxxxxxxxxx> wrote:

    Dear all,


    How are you?

    I have a cluster on Pacific with 3 hosts, each one with 1 mon,  1 mgr
    and 12 OSDs.

    One of the hosts, darkside1, has been out of quorum according to ceph
    status.

    Systemd showed 4 services dead, two mons and two mgrs.

    I managed to systemctl restart one mon and one mgr, but even after
    several attempts, the remaining mon and mgr services, when asked to
    restart, keep returning to a failed state after a few seconds.
    They try
    to auto-restart and then go into a failed state where systemd
    requires
    me to manually set them to "reset-failed" before trying to start
    again.
    But they never stay up. There are no clear messages about the
    issue in
    /var/log/ceph/cephadm.log.

    The host is still out of quorum.


    I have failed to "turn on debug" as per
    https://docs.ceph.com/en/pacific/rados/troubleshooting/log-and-debug/.

    It seems I do not know the proper incantantion for "ceph daemon X
    config
    show", no string for X seems to satisfy this command. I have tried
    adding this:

    [mon]

          debug mon = 20


    To my ceph.conf, but no additional lines of log are sent to
    /var/log/cephadm.log


      so I'm sorry I can´t provide more details.


    Could someone help me debug this situation? I am sure that if just
    reboot the machine, it will start up the services properly, as it
    always
    has done, but I would prefer to fix this without this action.


    Cordially,

    Renata.
    _______________________________________________
    ceph-users mailing list -- ceph-users@xxxxxxx
    To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux