Hi Adam!
I guess you only want the output for the 9100 port?
[root@darkside1]# ss -tulpn | grep 9100
tcp LISTEN 0 128 [::]:9100 [::]:*
users:(("node_exporter",pid=9103,fd=3))
Also, this:
[root@darkside1 ~]# ps aux | grep 9103
nfsnobo+ 9103 38.4 0.0 152332 105760 ? Ssl 10:12 82:35
/bin/node_exporter --no-collector.timex
Cordially,
Renata.
On 7/25/23 13:22, Adam King wrote:
okay, not much info on the mon failure. The other one at least seems
to be a simple port conflict. What does `sudo netstat -tulpn` give you
on that host?
On Tue, Jul 25, 2023 at 12:00 PM Renata Callado Borges
<renato.callado@xxxxxxxxxxxx> wrote:
Hi Adam!
Thank you for your response, but I am still trying to figure out the
issue. I am pretty sure the problem occurs "inside" the container,
and I
don´t know how to get logs from there.
Just in case, this is what systemd sees:
Jul 25 12:36:32 darkside1 systemd[1]: Stopped Ceph mon.darkside1 for
920740ee-cf2d-11ed-9097-08c0eb320eda.
Jul 25 12:36:32 darkside1 systemd[1]: Starting Ceph mon.darkside1 for
920740ee-cf2d-11ed-9097-08c0eb320eda...
Jul 25 12:36:33 darkside1 bash[52271]:
ceph-920740ee-cf2d-11ed-9097-08c0eb320eda-mon.darkside1
Jul 25 12:36:33 darkside1 podman[52311]: 2023-07-25 12:36:33.32233695
-0300 -03 m=+0.131005321 container create
7cf1d340e0a9658b2c9dac3a039a6f5b4fd2a5581bb60213b0ab708d24b69f54
(image=quay.io/ceph/ceph:v15 <http://quay.io/ceph/ceph:v15>,
name=ceph-920740ee-cf2d-11ed-9097-08c0eb320eda-mon.darkside1)
Jul 25 12:36:33 darkside1 podman[52311]: 2023-07-25
12:36:33.526241218
-0300 -03 m=+0.334909578 container init 7cf1d340e0a9658b2c9da
c3a039a6f5b4fd2a5581bb60213b0ab708d24b69f54
(image=quay.io/ceph/ceph:v15 <http://quay.io/ceph/ceph:v15>,
name=ceph-920740ee-cf2d-11ed-9097-08c0eb320eda-mon.darksid
e1)
Jul 25 12:36:33 darkside1 podman[52311]: 2023-07-25
12:36:33.556646854
-0300 -03 m=+0.365315225 container start 7cf1d340e0a9658b2c9d
ac3a039a6f5b4fd2a5581bb60213b0ab708d24b69f54
(image=quay.io/ceph/ceph:v15 <http://quay.io/ceph/ceph:v15>,
name=ceph-920740ee-cf2d-11ed-9097-08c0eb320eda-mon.darksi
de1)
Jul 25 12:36:33 darkside1 bash[52271]:
7cf1d340e0a9658b2c9dac3a039a6f5b4fd2a5581bb60213b0ab708d24b69f54
Jul 25 12:36:33 darkside1 systemd[1]: Started Ceph mon.darkside1 for
920740ee-cf2d-11ed-9097-08c0eb320eda.
Jul 25 12:36:43 darkside1 systemd[1]:
ceph-920740ee-cf2d-11ed-9097-08c0eb320eda@mon.darkside1.service: main
process exited, code=exi
ted, status=1/FAILURE
Jul 25 12:36:43 darkside1 systemd[1]: Unit
ceph-920740ee-cf2d-11ed-9097-08c0eb320eda@mon.darkside1.service
entered
failed state.
Jul 25 12:36:43 darkside1 systemd[1]:
ceph-920740ee-cf2d-11ed-9097-08c0eb320eda@mon.darkside1.service
failed.
Jul 25 12:36:53 darkside1 systemd[1]:
ceph-920740ee-cf2d-11ed-9097-08c0eb320eda@mon.darkside1.service
holdoff
time over, scheduling
restart.
Jul 25 12:36:53 darkside1 systemd[1]: Stopped Ceph mon.darkside1 for
920740ee-cf2d-11ed-9097-08c0eb320eda.
Jul 25 12:36:53 darkside1 systemd[1]: start request repeated too
quickly
for ceph-920740ee-cf2d-11ed-9097-08c0eb320eda@mon.darkside1
.service
Jul 25 12:36:53 darkside1 systemd[1]: Failed to start Ceph
mon.darkside1
for 920740ee-cf2d-11ed-9097-08c0eb320eda.
Jul 25 12:36:53 darkside1 systemd[1]: Unit
ceph-920740ee-cf2d-11ed-9097-08c0eb320eda@mon.darkside1.service
entered
failed state.
Jul 25 12:36:53 darkside1 systemd[1]:
ceph-920740ee-cf2d-11ed-9097-08c0eb320eda@mon.darkside1.service
failed.
Also, I get the following error every 10 minutes or so on "ceph -W
cephadm --watch-debug":
2023-07-25T12:35:38.115146-0300 mgr.darkside3.ujjyun [INF] Deploying
daemon node-exporter.darkside1 on darkside1
2023-07-25T12:35:38.612569-0300 mgr.darkside3.ujjyun [ERR] cephadm
exited with an error code: 1, stderr:Deploy daemon node-exporter.
darkside1 ...
Verifying port 9100 ...
Cannot bind to IP 0.0.0.0 port 9100: [Errno 98] Address already in use
ERROR: TCP Port(s) '9100' required for node-exporter already in use
Traceback (most recent call last):
File "/usr/share/ceph/mgr/cephadm/module.py", line 1029, in
_remote_connection
yield (conn, connr)
File "/usr/share/ceph/mgr/cephadm/module.py", line 1185, in
_run_cephadm
code, '\n'.join(err)))
orchestrator._interface.OrchestratorError: cephadm exited with an
error
code: 1, stderr:Deploy daemon node-exporter.darkside1 ...
Verifying port 9100 ...
Cannot bind to IP 0.0.0.0 port 9100: [Errno 98] Address already in use
ERROR: TCP Port(s) '9100' required for node-exporter already in use
And finally I get this error on the first line of output for my "ceph
mon dump":
2023-07-25T12:46:17.008-0300 7f145f59e700 -1 monclient(hunting):
handle_auth_bad_method server allowed_methods [2] but i only suppor
t [2,1]
Cordially,
Renata.
On 7/24/23 10:57, Adam King wrote:
> The logs you probably really want to look at here are the
journal logs
> from the mgr and mon. If you have a copy of the cephadm tool on the
> host, you can do a "cephadm ls --no-detail | grep systemd" to
list out
> the systemd unit names for the ceph daemons on the host, or just
look
> find the systemd unit names in the standard way you would for any
> other systemd unit (e.g. "systemctl -l | grep mgr'' will probably
> include the mgr one) and then take a look at "journalctl -eu
> <systemd-unit-name>" for the systemd unit for both the mgr and the
> mon. I'd expect near the end of the log it would include a
reason for
> going down.
>
> As for the debug_ms (I think that's what you want over "debug mon")
> stuff, I think that would need to be a command line option for the
> mgr/mon process. For cephadm deployments, the systemd unit is run
> through a "unit.run" file in
> /var/lib/ceph/<cluster-fsid>/<daemon-name>/unit.run. If you go
to the
> very end of that file, which will be a very long podman or
docker run
> command, add in the "--debug_ms 20" and then restart the systemd
unit
> for that daemon, it should cause the extra debug logging to happen
> from that daemon. I would say first check if there are useful
errors
> in the journal logs mentioned above before trying that though.
>
> On Mon, Jul 24, 2023 at 9:47 AM Renata Callado Borges
> <renato.callado@xxxxxxxxxxxx> wrote:
>
> Dear all,
>
>
> How are you?
>
> I have a cluster on Pacific with 3 hosts, each one with 1
mon, 1 mgr
> and 12 OSDs.
>
> One of the hosts, darkside1, has been out of quorum
according to ceph
> status.
>
> Systemd showed 4 services dead, two mons and two mgrs.
>
> I managed to systemctl restart one mon and one mgr, but even
after
> several attempts, the remaining mon and mgr services, when
asked to
> restart, keep returning to a failed state after a few seconds.
> They try
> to auto-restart and then go into a failed state where systemd
> requires
> me to manually set them to "reset-failed" before trying to start
> again.
> But they never stay up. There are no clear messages about the
> issue in
> /var/log/ceph/cephadm.log.
>
> The host is still out of quorum.
>
>
> I have failed to "turn on debug" as per
>
https://docs.ceph.com/en/pacific/rados/troubleshooting/log-and-debug/.
>
> It seems I do not know the proper incantantion for "ceph
daemon X
> config
> show", no string for X seems to satisfy this command. I have
tried
> adding this:
>
> [mon]
>
> debug mon = 20
>
>
> To my ceph.conf, but no additional lines of log are sent to
> /var/log/cephadm.log
>
>
> so I'm sorry I can´t provide more details.
>
>
> Could someone help me debug this situation? I am sure that
if just
> reboot the machine, it will start up the services properly,
as it
> always
> has done, but I would prefer to fix this without this action.
>
>
> Cordially,
>
> Renata.
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx