May be execute systemctl reset-failed <...> or even restart the node?
On 21/09/2023 14:26, Patrick Begou wrote:
Hi Igor,
the ceph-osd.2.log remains empty on the node where this osd is
located. This is what I get when manualy restarting the osd.
[root@mostha1 250f9864-0142-11ee-8e5f-00266cf8869c]# systemctl restart
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service
Job for ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service failed
because a timeout was exceeded.
See "systemctl status
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service" and
"journalctl -xe" for details.
[root@mostha1 250f9864-0142-11ee-8e5f-00266cf8869c]# journalctl -xe
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]:
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found
left-over process 5728 (podman) in control group while starting unit.
Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This
usually indicates unclean termination of a previous run, or service
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]:
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found
left-over process 5882 (bash) in control group while starting unit.
Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This
usually indicates unclean termination of a previous run, or service
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]:
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found
left-over process 5884 (podman) in control group while starting unit.
Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This
usually indicates unclean termination of a previous run, or service
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]:
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found
left-over process 6031 (bash) in control group while starting unit.
Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This
usually indicates unclean termination of a previous run, or service
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]:
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found
left-over process 6033 (podman) in control group while starting unit.
Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This
usually indicates unclean termination of a previous run, or service
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]:
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found
left-over process 6185 (bash) in control group while starting unit.
Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This
usually indicates unclean termination of a previous run, or service
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]:
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found
left-over process 6187 (podman) in control group while starting unit.
Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This
usually indicates unclean termination of a previous run, or service
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]:
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found
left-over process 14627 (bash) in control group while starting unit.
Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This
usually indicates unclean termination of a previous run, or service
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]:
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found
left-over process 14629 (podman) in control group while starting unit.
Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This
usually indicates unclean termination of a previous run, or service
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]:
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found
left-over process 14776 (bash) in control group while starting unit.
Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This
usually indicates unclean termination of a previous run, or service
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]:
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found
left-over process 14778 (podman) in control group while starting unit.
Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This
usually indicates unclean termination of a previous run, or service
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]:
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found
left-over process 15169 (bash) in control group while starting unit.
Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This
usually indicates unclean termination of a previous run, or service
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]:
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found
left-over process 15171 (podman) in control group while starting unit.
Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This
usually indicates unclean termination of a previous run, or service
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]:
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found
left-over process 15646 (bash) in control group while starting unit.
Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This
usually indicates unclean termination of a previous run, or service
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]:
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found
left-over process 15648 (podman) in control group while starting unit.
Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This
usually indicates unclean termination of a previous run, or service
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]:
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found
left-over process 15792 (bash) in control group while starting unit.
Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This
usually indicates unclean termination of a previous run, or service
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]:
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found
left-over process 15794 (podman) in control group while starting unit.
Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This
usually indicates unclean termination of a previous run, or service
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]:
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found
left-over process 25561 (bash) in control group while starting unit.
Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This
usually indicates unclean termination of a previous run, or service
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]:
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found
left-over process 25563 (podman) in control group while starting unit.
Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This
usually indicates unclean termination of a previous run, or service
implementation deficiencies.
Patrick
Le 21/09/2023 à 12:44, Igor Fedotov a écrit :
Hi Patrick,
please share osd restart log to investigate that.
Thanks,
Igor
On 21/09/2023 13:41, Patrick Begou wrote:
Hi,
After a power outage on my test ceph cluster, 2 osd fail to
restart. The log file show:
8e5f-00266cf8869c@osd.2.service: Failed with result 'timeout'.
Sep 21 11:55:02 mostha1 systemd[1]: Failed to start Ceph osd.2 for
250f9864-0142-11ee-8e5f-00266cf8869c.
Sep 21 11:55:12 mostha1 systemd[1]:
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Service
RestartSec=10s expired, scheduling restart.
Sep 21 11:55:12 mostha1 systemd[1]:
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Scheduled
restart job, restart counter is at 2.
Sep 21 11:55:12 mostha1 systemd[1]: Stopped Ceph osd.2 for
250f9864-0142-11ee-8e5f-00266cf8869c.
Sep 21 11:55:12 mostha1 systemd[1]:
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found
left-over process 1858 (bash) in control group while starting unit.
Ignoring.
Sep 21 11:55:12 mostha1 systemd[1]: This usually indicates unclean
termination of a previous run, or service implementation deficiencies.
Sep 21 11:55:12 mostha1 systemd[1]:
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found
left-over process 2815 (podman) in control group while starting
unit. Ignoring.
This is not critical as it is a test cluster and it is actually
rebalancing on other osd but I would like to know how to return to
HEALTH_OK status.
Smartctl show the HDD are OK.
So is there a way to recover the osd from this state ? Version is
15.2.17 (juste moved from 15.2.13 to 15.2.17 yesterday, will try to
move to latest versions as soon as this problem is solved)
Thanks
Patrick
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx