Re: Failing to restart mon and mgr daemons on Pacific

Adam King <adking@xxxxxxxxxx> · Mon, 24 Jul 2023 09:57:53 -0400

The logs you probably really want to look at here are the journal logs from
the mgr and mon. If you have a copy of the cephadm tool on the host, you
can do a "cephadm ls --no-detail | grep systemd" to list out the systemd
unit names for the ceph daemons on the host, or just look find the systemd
unit names in the standard way you would for any other systemd unit (e.g.
"systemctl -l | grep mgr'' will probably include the mgr one) and then take
a look at "journalctl -eu <systemd-unit-name>" for the systemd unit for
both the mgr and the mon. I'd expect near the end of the log it would
include a reason for going down.

As for the debug_ms (I think that's what you want over "debug mon") stuff,
I think that would need to be a command line option for the mgr/mon
process. For cephadm deployments, the systemd unit is run through a
"unit.run" file in /var/lib/ceph/<cluster-fsid>/<daemon-name>/unit.run. If
you go to the very end of that file, which will be a very long podman or
docker run command, add in the "--debug_ms 20" and then restart the systemd
unit for that daemon, it should cause the extra debug logging to happen
from that daemon. I would say first check if there are useful errors in the
journal logs mentioned above before trying that though.

On Mon, Jul 24, 2023 at 9:47 AM Renata Callado Borges <
renato.callado@xxxxxxxxxxxx> wrote:

> Dear all,
>
>
> How are you?
>
> I have a cluster on Pacific with 3 hosts, each one with 1 mon,  1 mgr
> and 12 OSDs.
>
> One of the hosts, darkside1, has been out of quorum according to ceph
> status.
>
> Systemd showed 4 services dead, two mons and two mgrs.
>
> I managed to systemctl restart one mon and one mgr, but even after
> several attempts, the remaining mon and mgr services, when asked to
> restart, keep returning to a failed state after a few seconds. They try
> to auto-restart and then go into a failed state where systemd requires
> me to manually set them to "reset-failed" before trying to start again.
> But they never stay up. There are no clear messages about the issue in
> /var/log/ceph/cephadm.log.
>
> The host is still out of quorum.
>
>
> I have failed to "turn on debug" as per
> https://docs.ceph.com/en/pacific/rados/troubleshooting/log-and-debug/.
> It seems I do not know the proper incantantion for "ceph daemon X config
> show", no string for X seems to satisfy this command. I have tried
> adding this:
>
> [mon]
>
>       debug mon = 20
>
>
> To my ceph.conf, but no additional lines of log are sent to
> /var/log/cephadm.log
>
>
>   so I'm sorry I can´t provide more details.
>
>
> Could someone help me debug this situation? I am sure that if just
> reboot the machine, it will start up the services properly, as it always
> has done, but I would prefer to fix this without this action.
>
>
> Cordially,
>
> Renata.
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx