Re: Failing to restart mon and mgr daemons on Pacific

Adam King <adking@xxxxxxxxxx> · Tue, 25 Jul 2023 12:22:03 -0400

okay, not much info on the mon failure. The other one at least seems to be
a simple port conflict. What does `sudo netstat -tulpn` give you on that
host?

On Tue, Jul 25, 2023 at 12:00 PM Renata Callado Borges <
renato.callado@xxxxxxxxxxxx> wrote:

> Hi Adam!
>
>
> Thank you for your response, but I am still trying to figure out the
> issue. I am pretty sure the problem occurs "inside" the container, and I
> don´t  know how to get logs from there.
>
> Just in case, this is what systemd sees:
>
>
> Jul 25 12:36:32 darkside1 systemd[1]: Stopped Ceph mon.darkside1 for
> 920740ee-cf2d-11ed-9097-08c0eb320eda.
> Jul 25 12:36:32 darkside1 systemd[1]: Starting Ceph mon.darkside1 for
> 920740ee-cf2d-11ed-9097-08c0eb320eda...
> Jul 25 12:36:33 darkside1 bash[52271]:
> ceph-920740ee-cf2d-11ed-9097-08c0eb320eda-mon.darkside1
> Jul 25 12:36:33 darkside1 podman[52311]: 2023-07-25 12:36:33.32233695
> -0300 -03 m=+0.131005321 container create
> 7cf1d340e0a9658b2c9dac3a039a6f5b4fd2a5581bb60213b0ab708d24b69f54
> (image=quay.io/ceph/ceph:v15,
> name=ceph-920740ee-cf2d-11ed-9097-08c0eb320eda-mon.darkside1)
> Jul 25 12:36:33 darkside1 podman[52311]: 2023-07-25 12:36:33.526241218
> -0300 -03 m=+0.334909578 container init 7cf1d340e0a9658b2c9da
> c3a039a6f5b4fd2a5581bb60213b0ab708d24b69f54
> (image=quay.io/ceph/ceph:v15,
> name=ceph-920740ee-cf2d-11ed-9097-08c0eb320eda-mon.darksid
> e1)
> Jul 25 12:36:33 darkside1 podman[52311]: 2023-07-25 12:36:33.556646854
> -0300 -03 m=+0.365315225 container start 7cf1d340e0a9658b2c9d
> ac3a039a6f5b4fd2a5581bb60213b0ab708d24b69f54
> (image=quay.io/ceph/ceph:v15,
> name=ceph-920740ee-cf2d-11ed-9097-08c0eb320eda-mon.darksi
> de1)
> Jul 25 12:36:33 darkside1 bash[52271]:
> 7cf1d340e0a9658b2c9dac3a039a6f5b4fd2a5581bb60213b0ab708d24b69f54
> Jul 25 12:36:33 darkside1 systemd[1]: Started Ceph mon.darkside1 for
> 920740ee-cf2d-11ed-9097-08c0eb320eda.
> Jul 25 12:36:43 darkside1 systemd[1]:
> ceph-920740ee-cf2d-11ed-9097-08c0eb320eda@mon.darkside1.service: main
> process exited, code=exi
> ted, status=1/FAILURE
> Jul 25 12:36:43 darkside1 systemd[1]: Unit
> ceph-920740ee-cf2d-11ed-9097-08c0eb320eda@mon.darkside1.service entered
> failed state.
> Jul 25 12:36:43 darkside1 systemd[1]:
> ceph-920740ee-cf2d-11ed-9097-08c0eb320eda@mon.darkside1.service failed.
> Jul 25 12:36:53 darkside1 systemd[1]:
> ceph-920740ee-cf2d-11ed-9097-08c0eb320eda@mon.darkside1.service holdoff
> time over, scheduling
> restart.
> Jul 25 12:36:53 darkside1 systemd[1]: Stopped Ceph mon.darkside1 for
> 920740ee-cf2d-11ed-9097-08c0eb320eda.
> Jul 25 12:36:53 darkside1 systemd[1]: start request repeated too quickly
> for ceph-920740ee-cf2d-11ed-9097-08c0eb320eda@mon.darkside1
> .service
> Jul 25 12:36:53 darkside1 systemd[1]: Failed to start Ceph mon.darkside1
> for 920740ee-cf2d-11ed-9097-08c0eb320eda.
> Jul 25 12:36:53 darkside1 systemd[1]: Unit
> ceph-920740ee-cf2d-11ed-9097-08c0eb320eda@mon.darkside1.service entered
> failed state.
> Jul 25 12:36:53 darkside1 systemd[1]:
> ceph-920740ee-cf2d-11ed-9097-08c0eb320eda@mon.darkside1.service failed.
>
>
> Also, I get the following error every 10 minutes or so on "ceph -W
> cephadm --watch-debug":
>
>
> 2023-07-25T12:35:38.115146-0300 mgr.darkside3.ujjyun [INF] Deploying
> daemon node-exporter.darkside1 on darkside1
> 2023-07-25T12:35:38.612569-0300 mgr.darkside3.ujjyun [ERR] cephadm
> exited with an error code: 1, stderr:Deploy daemon node-exporter.
> darkside1 ...
> Verifying port 9100 ...
> Cannot bind to IP 0.0.0.0 port 9100: [Errno 98] Address already in use
> ERROR: TCP Port(s) '9100' required for node-exporter already in use
> Traceback (most recent call last):
>    File "/usr/share/ceph/mgr/cephadm/module.py", line 1029, in
> _remote_connection
>      yield (conn, connr)
>    File "/usr/share/ceph/mgr/cephadm/module.py", line 1185, in _run_cephadm
>      code, '\n'.join(err)))
> orchestrator._interface.OrchestratorError: cephadm exited with an error
> code: 1, stderr:Deploy daemon node-exporter.darkside1 ...
> Verifying port 9100 ...
> Cannot bind to IP 0.0.0.0 port 9100: [Errno 98] Address already in use
> ERROR: TCP Port(s) '9100' required for node-exporter already in use
>
> And finally I get this error on the first line of output for my "ceph
> mon dump":
>
> 2023-07-25T12:46:17.008-0300 7f145f59e700 -1 monclient(hunting):
> handle_auth_bad_method server allowed_methods [2] but i only suppor
> t [2,1]
>
>
> Cordially,
>
> Renata.
>
> On 7/24/23 10:57, Adam King wrote:
> > The logs you probably really want to look at here are the journal logs
> > from the mgr and mon. If you have a copy of the cephadm tool on the
> > host, you can do a "cephadm ls --no-detail | grep systemd" to list out
> > the systemd unit names for the ceph daemons on the host, or just look
> > find the systemd unit names in the standard way you would for any
> > other systemd unit (e.g. "systemctl -l | grep mgr'' will probably
> > include the mgr one) and then take a look at "journalctl -eu
> > <systemd-unit-name>" for the systemd unit for both the mgr and the
> > mon. I'd expect near the end of the log it would include a reason for
> > going down.
> >
> > As for the debug_ms (I think that's what you want over "debug mon")
> > stuff, I think that would need to be a command line option for the
> > mgr/mon process. For cephadm deployments, the systemd unit is run
> > through a "unit.run" file in
> > /var/lib/ceph/<cluster-fsid>/<daemon-name>/unit.run. If you go to the
> > very end of that file, which will be a very long podman or docker run
> > command, add in the "--debug_ms 20" and then restart the systemd unit
> > for that daemon, it should cause the extra debug logging to happen
> > from that daemon. I would say first check if there are useful errors
> > in the journal logs mentioned above before trying that though.
> >
> > On Mon, Jul 24, 2023 at 9:47 AM Renata Callado Borges
> > <renato.callado@xxxxxxxxxxxx> wrote:
> >
> >     Dear all,
> >
> >
> >     How are you?
> >
> >     I have a cluster on Pacific with 3 hosts, each one with 1 mon,  1 mgr
> >     and 12 OSDs.
> >
> >     One of the hosts, darkside1, has been out of quorum according to ceph
> >     status.
> >
> >     Systemd showed 4 services dead, two mons and two mgrs.
> >
> >     I managed to systemctl restart one mon and one mgr, but even after
> >     several attempts, the remaining mon and mgr services, when asked to
> >     restart, keep returning to a failed state after a few seconds.
> >     They try
> >     to auto-restart and then go into a failed state where systemd
> >     requires
> >     me to manually set them to "reset-failed" before trying to start
> >     again.
> >     But they never stay up. There are no clear messages about the
> >     issue in
> >     /var/log/ceph/cephadm.log.
> >
> >     The host is still out of quorum.
> >
> >
> >     I have failed to "turn on debug" as per
> >
> https://docs.ceph.com/en/pacific/rados/troubleshooting/log-and-debug/.
> >
> >     It seems I do not know the proper incantantion for "ceph daemon X
> >     config
> >     show", no string for X seems to satisfy this command. I have tried
> >     adding this:
> >
> >     [mon]
> >
> >           debug mon = 20
> >
> >
> >     To my ceph.conf, but no additional lines of log are sent to
> >     /var/log/cephadm.log
> >
> >
> >       so I'm sorry I can´t provide more details.
> >
> >
> >     Could someone help me debug this situation? I am sure that if just
> >     reboot the machine, it will start up the services properly, as it
> >     always
> >     has done, but I would prefer to fix this without this action.
> >
> >
> >     Cordially,
> >
> >     Renata.
> >     _______________________________________________
> >     ceph-users mailing list -- ceph-users@xxxxxxx
> >     To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx