Re: Failing to restart mon and mgr daemons on Pacific

Renata Callado Borges <renato.callado@xxxxxxxxxxxx> · Tue, 25 Jul 2023 13:52:05 -0300

Hi Adam!

I guess you only want the output for the 9100 port?

[root@darkside1]# ss -tulpn | grep 9100
tcp    LISTEN     0      128    [::]:9100 [::]:*                   
users:(("node_exporter",pid=9103,fd=3))

Also, this:

[root@darkside1 ~]# ps aux | grep 9103
nfsnobo+   9103 38.4  0.0 152332 105760 ?       Ssl  10:12  82:35 
/bin/node_exporter --no-collector.timex

Cordially,

Renata.

On 7/25/23 13:22, Adam King wrote:
okay, not much info on the mon failure. The other one at least seems 
to be a simple port conflict. What does `sudo netstat -tulpn` give you 
on that host?

On Tue, Jul 25, 2023 at 12:00 PM Renata Callado Borges 
<renato.callado@xxxxxxxxxxxx> wrote:

    Hi Adam!

    Thank you for your response, but I am still trying to figure out the
    issue. I am pretty sure the problem occurs "inside" the container,
    and I
    don´t  know how to get logs from there.

    Just in case, this is what systemd sees:

    Jul 25 12:36:32 darkside1 systemd[1]: Stopped Ceph mon.darkside1 for
    920740ee-cf2d-11ed-9097-08c0eb320eda.
    Jul 25 12:36:32 darkside1 systemd[1]: Starting Ceph mon.darkside1 for
    920740ee-cf2d-11ed-9097-08c0eb320eda...
    Jul 25 12:36:33 darkside1 bash[52271]:
    ceph-920740ee-cf2d-11ed-9097-08c0eb320eda-mon.darkside1
    Jul 25 12:36:33 darkside1 podman[52311]: 2023-07-25 12:36:33.32233695
    -0300 -03 m=+0.131005321 container create
    7cf1d340e0a9658b2c9dac3a039a6f5b4fd2a5581bb60213b0ab708d24b69f54
    (image=quay.io/ceph/ceph:v15 <http://quay.io/ceph/ceph:v15>,
    name=ceph-920740ee-cf2d-11ed-9097-08c0eb320eda-mon.darkside1)
    Jul 25 12:36:33 darkside1 podman[52311]: 2023-07-25
    12:36:33.526241218
    -0300 -03 m=+0.334909578 container init 7cf1d340e0a9658b2c9da
    c3a039a6f5b4fd2a5581bb60213b0ab708d24b69f54
    (image=quay.io/ceph/ceph:v15 <http://quay.io/ceph/ceph:v15>,
    name=ceph-920740ee-cf2d-11ed-9097-08c0eb320eda-mon.darksid
    e1)
    Jul 25 12:36:33 darkside1 podman[52311]: 2023-07-25
    12:36:33.556646854
    -0300 -03 m=+0.365315225 container start 7cf1d340e0a9658b2c9d
    ac3a039a6f5b4fd2a5581bb60213b0ab708d24b69f54
    (image=quay.io/ceph/ceph:v15 <http://quay.io/ceph/ceph:v15>,
    name=ceph-920740ee-cf2d-11ed-9097-08c0eb320eda-mon.darksi
    de1)
    Jul 25 12:36:33 darkside1 bash[52271]:
    7cf1d340e0a9658b2c9dac3a039a6f5b4fd2a5581bb60213b0ab708d24b69f54
    Jul 25 12:36:33 darkside1 systemd[1]: Started Ceph mon.darkside1 for
    920740ee-cf2d-11ed-9097-08c0eb320eda.
    Jul 25 12:36:43 darkside1 systemd[1]:
    ceph-920740ee-cf2d-11ed-9097-08c0eb320eda@mon.darkside1.service: main
    process exited, code=exi
    ted, status=1/FAILURE
    Jul 25 12:36:43 darkside1 systemd[1]: Unit
    ceph-920740ee-cf2d-11ed-9097-08c0eb320eda@mon.darkside1.service
    entered
    failed state.
    Jul 25 12:36:43 darkside1 systemd[1]:
    ceph-920740ee-cf2d-11ed-9097-08c0eb320eda@mon.darkside1.service
    failed.
    Jul 25 12:36:53 darkside1 systemd[1]:
    ceph-920740ee-cf2d-11ed-9097-08c0eb320eda@mon.darkside1.service
    holdoff
    time over, scheduling
    restart.
    Jul 25 12:36:53 darkside1 systemd[1]: Stopped Ceph mon.darkside1 for
    920740ee-cf2d-11ed-9097-08c0eb320eda.
    Jul 25 12:36:53 darkside1 systemd[1]: start request repeated too
    quickly
    for ceph-920740ee-cf2d-11ed-9097-08c0eb320eda@mon.darkside1
    .service
    Jul 25 12:36:53 darkside1 systemd[1]: Failed to start Ceph
    mon.darkside1
    for 920740ee-cf2d-11ed-9097-08c0eb320eda.
    Jul 25 12:36:53 darkside1 systemd[1]: Unit
    ceph-920740ee-cf2d-11ed-9097-08c0eb320eda@mon.darkside1.service
    entered
    failed state.
    Jul 25 12:36:53 darkside1 systemd[1]:
    ceph-920740ee-cf2d-11ed-9097-08c0eb320eda@mon.darkside1.service
    failed.

    Also, I get the following error every 10 minutes or so on "ceph -W
    cephadm --watch-debug":

    2023-07-25T12:35:38.115146-0300 mgr.darkside3.ujjyun [INF] Deploying
    daemon node-exporter.darkside1 on darkside1
    2023-07-25T12:35:38.612569-0300 mgr.darkside3.ujjyun [ERR] cephadm
    exited with an error code: 1, stderr:Deploy daemon node-exporter.
    darkside1 ...
    Verifying port 9100 ...
    Cannot bind to IP 0.0.0.0 port 9100: [Errno 98] Address already in use
    ERROR: TCP Port(s) '9100' required for node-exporter already in use
    Traceback (most recent call last):
       File "/usr/share/ceph/mgr/cephadm/module.py", line 1029, in
    _remote_connection
         yield (conn, connr)
       File "/usr/share/ceph/mgr/cephadm/module.py", line 1185, in
    _run_cephadm
         code, '\n'.join(err)))
    orchestrator._interface.OrchestratorError: cephadm exited with an
    error
    code: 1, stderr:Deploy daemon node-exporter.darkside1 ...
    Verifying port 9100 ...
    Cannot bind to IP 0.0.0.0 port 9100: [Errno 98] Address already in use
    ERROR: TCP Port(s) '9100' required for node-exporter already in use

    And finally I get this error on the first line of output for my "ceph
    mon dump":

    2023-07-25T12:46:17.008-0300 7f145f59e700 -1 monclient(hunting):
    handle_auth_bad_method server allowed_methods [2] but i only suppor
    t [2,1]

    Cordially,

    Renata.

    On 7/24/23 10:57, Adam King wrote:
    > The logs you probably really want to look at here are the
    journal logs
    > from the mgr and mon. If you have a copy of the cephadm tool on the
    > host, you can do a "cephadm ls --no-detail | grep systemd" to
    list out
    > the systemd unit names for the ceph daemons on the host, or just
    look
    > find the systemd unit names in the standard way you would for any
    > other systemd unit (e.g. "systemctl -l | grep mgr'' will probably
    > include the mgr one) and then take a look at "journalctl -eu
    > <systemd-unit-name>" for the systemd unit for both the mgr and the
    > mon. I'd expect near the end of the log it would include a
    reason for
    > going down.
    >
    > As for the debug_ms (I think that's what you want over "debug mon")
    > stuff, I think that would need to be a command line option for the
    > mgr/mon process. For cephadm deployments, the systemd unit is run
    > through a "unit.run" file in
    > /var/lib/ceph/<cluster-fsid>/<daemon-name>/unit.run. If you go
    to the
    > very end of that file, which will be a very long podman or
    docker run
    > command, add in the "--debug_ms 20" and then restart the systemd
    unit
    > for that daemon, it should cause the extra debug logging to happen
    > from that daemon. I would say first check if there are useful
    errors
    > in the journal logs mentioned above before trying that though.
    >
    > On Mon, Jul 24, 2023 at 9:47 AM Renata Callado Borges
    > <renato.callado@xxxxxxxxxxxx> wrote:
    >
    >     Dear all,
    >
    >
    >     How are you?
    >
    >     I have a cluster on Pacific with 3 hosts, each one with 1
    mon,  1 mgr
    >     and 12 OSDs.
    >
    >     One of the hosts, darkside1, has been out of quorum
    according to ceph
    >     status.
    >
    >     Systemd showed 4 services dead, two mons and two mgrs.
    >
    >     I managed to systemctl restart one mon and one mgr, but even
    after
    >     several attempts, the remaining mon and mgr services, when
    asked to
    >     restart, keep returning to a failed state after a few seconds.
    >     They try
    >     to auto-restart and then go into a failed state where systemd
    >     requires
    >     me to manually set them to "reset-failed" before trying to start
    >     again.
    >     But they never stay up. There are no clear messages about the
    >     issue in
    >     /var/log/ceph/cephadm.log.
    >
    >     The host is still out of quorum.
    >
    >
    >     I have failed to "turn on debug" as per
    >
    https://docs.ceph.com/en/pacific/rados/troubleshooting/log-and-debug/.
    >
    >     It seems I do not know the proper incantantion for "ceph
    daemon X
    >     config
    >     show", no string for X seems to satisfy this command. I have
    tried
    >     adding this:
    >
    >     [mon]
    >
    >           debug mon = 20
    >
    >
    >     To my ceph.conf, but no additional lines of log are sent to
    >     /var/log/cephadm.log
    >
    >
    >       so I'm sorry I can´t provide more details.
    >
    >
    >     Could someone help me debug this situation? I am sure that
    if just
    >     reboot the machine, it will start up the services properly,
    as it
    >     always
    >     has done, but I would prefer to fix this without this action.
    >
    >
    >     Cordially,
    >
    >     Renata.
    >     _______________________________________________
    >     ceph-users mailing list -- ceph-users@xxxxxxx
    >     To unsubscribe send an email to ceph-users-leave@xxxxxxx
    >
    _______________________________________________
    ceph-users mailing list -- ceph-users@xxxxxxx
    To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx