okay, not much info on the mon failure. The other one at least seems to be a simple port conflict. What does `sudo netstat -tulpn` give you on that host? On Tue, Jul 25, 2023 at 12:00 PM Renata Callado Borges < renato.callado@xxxxxxxxxxxx> wrote: > Hi Adam! > > > Thank you for your response, but I am still trying to figure out the > issue. I am pretty sure the problem occurs "inside" the container, and I > don´t know how to get logs from there. > > Just in case, this is what systemd sees: > > > Jul 25 12:36:32 darkside1 systemd[1]: Stopped Ceph mon.darkside1 for > 920740ee-cf2d-11ed-9097-08c0eb320eda. > Jul 25 12:36:32 darkside1 systemd[1]: Starting Ceph mon.darkside1 for > 920740ee-cf2d-11ed-9097-08c0eb320eda... > Jul 25 12:36:33 darkside1 bash[52271]: > ceph-920740ee-cf2d-11ed-9097-08c0eb320eda-mon.darkside1 > Jul 25 12:36:33 darkside1 podman[52311]: 2023-07-25 12:36:33.32233695 > -0300 -03 m=+0.131005321 container create > 7cf1d340e0a9658b2c9dac3a039a6f5b4fd2a5581bb60213b0ab708d24b69f54 > (image=quay.io/ceph/ceph:v15, > name=ceph-920740ee-cf2d-11ed-9097-08c0eb320eda-mon.darkside1) > Jul 25 12:36:33 darkside1 podman[52311]: 2023-07-25 12:36:33.526241218 > -0300 -03 m=+0.334909578 container init 7cf1d340e0a9658b2c9da > c3a039a6f5b4fd2a5581bb60213b0ab708d24b69f54 > (image=quay.io/ceph/ceph:v15, > name=ceph-920740ee-cf2d-11ed-9097-08c0eb320eda-mon.darksid > e1) > Jul 25 12:36:33 darkside1 podman[52311]: 2023-07-25 12:36:33.556646854 > -0300 -03 m=+0.365315225 container start 7cf1d340e0a9658b2c9d > ac3a039a6f5b4fd2a5581bb60213b0ab708d24b69f54 > (image=quay.io/ceph/ceph:v15, > name=ceph-920740ee-cf2d-11ed-9097-08c0eb320eda-mon.darksi > de1) > Jul 25 12:36:33 darkside1 bash[52271]: > 7cf1d340e0a9658b2c9dac3a039a6f5b4fd2a5581bb60213b0ab708d24b69f54 > Jul 25 12:36:33 darkside1 systemd[1]: Started Ceph mon.darkside1 for > 920740ee-cf2d-11ed-9097-08c0eb320eda. > Jul 25 12:36:43 darkside1 systemd[1]: > ceph-920740ee-cf2d-11ed-9097-08c0eb320eda@mon.darkside1.service: main > process exited, code=exi > ted, status=1/FAILURE > Jul 25 12:36:43 darkside1 systemd[1]: Unit > ceph-920740ee-cf2d-11ed-9097-08c0eb320eda@mon.darkside1.service entered > failed state. > Jul 25 12:36:43 darkside1 systemd[1]: > ceph-920740ee-cf2d-11ed-9097-08c0eb320eda@mon.darkside1.service failed. > Jul 25 12:36:53 darkside1 systemd[1]: > ceph-920740ee-cf2d-11ed-9097-08c0eb320eda@mon.darkside1.service holdoff > time over, scheduling > restart. > Jul 25 12:36:53 darkside1 systemd[1]: Stopped Ceph mon.darkside1 for > 920740ee-cf2d-11ed-9097-08c0eb320eda. > Jul 25 12:36:53 darkside1 systemd[1]: start request repeated too quickly > for ceph-920740ee-cf2d-11ed-9097-08c0eb320eda@mon.darkside1 > .service > Jul 25 12:36:53 darkside1 systemd[1]: Failed to start Ceph mon.darkside1 > for 920740ee-cf2d-11ed-9097-08c0eb320eda. > Jul 25 12:36:53 darkside1 systemd[1]: Unit > ceph-920740ee-cf2d-11ed-9097-08c0eb320eda@mon.darkside1.service entered > failed state. > Jul 25 12:36:53 darkside1 systemd[1]: > ceph-920740ee-cf2d-11ed-9097-08c0eb320eda@mon.darkside1.service failed. > > > Also, I get the following error every 10 minutes or so on "ceph -W > cephadm --watch-debug": > > > 2023-07-25T12:35:38.115146-0300 mgr.darkside3.ujjyun [INF] Deploying > daemon node-exporter.darkside1 on darkside1 > 2023-07-25T12:35:38.612569-0300 mgr.darkside3.ujjyun [ERR] cephadm > exited with an error code: 1, stderr:Deploy daemon node-exporter. > darkside1 ... > Verifying port 9100 ... > Cannot bind to IP 0.0.0.0 port 9100: [Errno 98] Address already in use > ERROR: TCP Port(s) '9100' required for node-exporter already in use > Traceback (most recent call last): > File "/usr/share/ceph/mgr/cephadm/module.py", line 1029, in > _remote_connection > yield (conn, connr) > File "/usr/share/ceph/mgr/cephadm/module.py", line 1185, in _run_cephadm > code, '\n'.join(err))) > orchestrator._interface.OrchestratorError: cephadm exited with an error > code: 1, stderr:Deploy daemon node-exporter.darkside1 ... > Verifying port 9100 ... > Cannot bind to IP 0.0.0.0 port 9100: [Errno 98] Address already in use > ERROR: TCP Port(s) '9100' required for node-exporter already in use > > And finally I get this error on the first line of output for my "ceph > mon dump": > > 2023-07-25T12:46:17.008-0300 7f145f59e700 -1 monclient(hunting): > handle_auth_bad_method server allowed_methods [2] but i only suppor > t [2,1] > > > Cordially, > > Renata. > > On 7/24/23 10:57, Adam King wrote: > > The logs you probably really want to look at here are the journal logs > > from the mgr and mon. If you have a copy of the cephadm tool on the > > host, you can do a "cephadm ls --no-detail | grep systemd" to list out > > the systemd unit names for the ceph daemons on the host, or just look > > find the systemd unit names in the standard way you would for any > > other systemd unit (e.g. "systemctl -l | grep mgr'' will probably > > include the mgr one) and then take a look at "journalctl -eu > > <systemd-unit-name>" for the systemd unit for both the mgr and the > > mon. I'd expect near the end of the log it would include a reason for > > going down. > > > > As for the debug_ms (I think that's what you want over "debug mon") > > stuff, I think that would need to be a command line option for the > > mgr/mon process. For cephadm deployments, the systemd unit is run > > through a "unit.run" file in > > /var/lib/ceph/<cluster-fsid>/<daemon-name>/unit.run. If you go to the > > very end of that file, which will be a very long podman or docker run > > command, add in the "--debug_ms 20" and then restart the systemd unit > > for that daemon, it should cause the extra debug logging to happen > > from that daemon. I would say first check if there are useful errors > > in the journal logs mentioned above before trying that though. > > > > On Mon, Jul 24, 2023 at 9:47 AM Renata Callado Borges > > <renato.callado@xxxxxxxxxxxx> wrote: > > > > Dear all, > > > > > > How are you? > > > > I have a cluster on Pacific with 3 hosts, each one with 1 mon, 1 mgr > > and 12 OSDs. > > > > One of the hosts, darkside1, has been out of quorum according to ceph > > status. > > > > Systemd showed 4 services dead, two mons and two mgrs. > > > > I managed to systemctl restart one mon and one mgr, but even after > > several attempts, the remaining mon and mgr services, when asked to > > restart, keep returning to a failed state after a few seconds. > > They try > > to auto-restart and then go into a failed state where systemd > > requires > > me to manually set them to "reset-failed" before trying to start > > again. > > But they never stay up. There are no clear messages about the > > issue in > > /var/log/ceph/cephadm.log. > > > > The host is still out of quorum. > > > > > > I have failed to "turn on debug" as per > > > https://docs.ceph.com/en/pacific/rados/troubleshooting/log-and-debug/. > > > > It seems I do not know the proper incantantion for "ceph daemon X > > config > > show", no string for X seems to satisfy this command. I have tried > > adding this: > > > > [mon] > > > > debug mon = 20 > > > > > > To my ceph.conf, but no additional lines of log are sent to > > /var/log/cephadm.log > > > > > > so I'm sorry I can´t provide more details. > > > > > > Could someone help me debug this situation? I am sure that if just > > reboot the machine, it will start up the services properly, as it > > always > > has done, but I would prefer to fix this without this action. > > > > > > Cordially, > > > > Renata. > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx