Hi Patrick,
It seems your disk or controller are damaged. Are other disks connected
to the same controller working ok? If so, I'd say disk is dead.
Cheers
El 21/9/23 a las 16:17, Patrick Begou escribió:
Hi Igor,
a "systemctl reset-failed" doesn't restart the osd.
I reboot the node and now it show some error on the HDD:
[ 107.716769] ata3.00: exception Emask 0x0 SAct 0x800000 SErr 0x0
action 0x0
[ 107.716782] ata3.00: irq_stat 0x40000008
[ 107.716787] ata3.00: failed command: READ FPDMA QUEUED
[ 107.716791] ata3.00: cmd 60/00:b8:00:a8:08/08:00:0e:00:00/40 tag 23
ncq dma 1048576 in
res 41/40:00:c2:ad:08/00:00:0e:00:00/40 Emask
0x409 (media error) <F>
[ 107.716802] ata3.00: status: { DRDY ERR }
[ 107.716806] ata3.00: error: { UNC }
[ 107.728547] ata3.00: configured for UDMA/133
[ 107.728575] sd 2:0:0:0: [sda] tag#23 FAILED Result: hostbyte=DID_OK
driverbyte=DRIVER_OK cmd_age=1s
[ 107.728581] sd 2:0:0:0: [sda] tag#23 Sense Key : Medium Error
[current]
[ 107.728585] sd 2:0:0:0: [sda] tag#23 Add. Sense: Unrecovered read
error - auto reallocate failed
[ 107.728590] sd 2:0:0:0: [sda] tag#23 CDB: Read(10) 28 00 0e 08 a8
00 00 08 00 00
[ 107.728592] I/O error, dev sda, sector 235449794 op 0x0:(READ)
flags 0x80700 phys_seg 6 prio class 2
[ 107.728623] ata3: EH complete
[ 109.203256] ata3.00: exception Emask 0x0 SAct 0x20000000 SErr 0x0
action 0x0
[ 109.203268] ata3.00: irq_stat 0x40000008
[ 109.203274] ata3.00: failed command: READ FPDMA QUEUED
[ 109.203277] ata3.00: cmd 60/08:e8:48:ad:08/00:00:0e:00:00/40 tag 29
ncq dma 4096 in
res 41/40:00:48:ad:08/00:00:0e:00:00/40 Emask
0x409 (media error) <F>
[ 109.203289] ata3.00: status: { DRDY ERR }
[ 109.203292] ata3.00: error: { UNC }
I think the storage is corrupted and I have te reset it all.
Patrick
Le 21/09/2023 à 13:32, Igor Fedotov a écrit :
May be execute systemctl reset-failed <...> or even restart the node?
On 21/09/2023 14:26, Patrick Begou wrote:
Hi Igor,
the ceph-osd.2.log remains empty on the node where this osd is
located. This is what I get when manualy restarting the osd.
[root@mostha1 250f9864-0142-11ee-8e5f-00266cf8869c]# systemctl
restart ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service
Job for ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service
failed because a timeout was exceeded.
See "systemctl status
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service" and
"journalctl -xe" for details.
[root@mostha1 250f9864-0142-11ee-8e5f-00266cf8869c]# journalctl -xe
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]:
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found
left-over process 5728 (podman) in control group while starting
unit. Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This
usually indicates unclean termination of a previous run, or service
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]:
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found
left-over process 5882 (bash) in control group while starting unit.
Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This
usually indicates unclean termination of a previous run, or service
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]:
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found
left-over process 5884 (podman) in control group while starting
unit. Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This
usually indicates unclean termination of a previous run, or service
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]:
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found
left-over process 6031 (bash) in control group while starting unit.
Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This
usually indicates unclean termination of a previous run, or service
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]:
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found
left-over process 6033 (podman) in control group while starting
unit. Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This
usually indicates unclean termination of a previous run, or service
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]:
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found
left-over process 6185 (bash) in control group while starting unit.
Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This
usually indicates unclean termination of a previous run, or service
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]:
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found
left-over process 6187 (podman) in control group while starting
unit. Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This
usually indicates unclean termination of a previous run, or service
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]:
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found
left-over process 14627 (bash) in control group while starting unit.
Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This
usually indicates unclean termination of a previous run, or service
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]:
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found
left-over process 14629 (podman) in control group while starting
unit. Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This
usually indicates unclean termination of a previous run, or service
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]:
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found
left-over process 14776 (bash) in control group while starting unit.
Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This
usually indicates unclean termination of a previous run, or service
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]:
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found
left-over process 14778 (podman) in control group while starting
unit. Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This
usually indicates unclean termination of a previous run, or service
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]:
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found
left-over process 15169 (bash) in control group while starting unit.
Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This
usually indicates unclean termination of a previous run, or service
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]:
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found
left-over process 15171 (podman) in control group while starting
unit. Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This
usually indicates unclean termination of a previous run, or service
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]:
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found
left-over process 15646 (bash) in control group while starting unit.
Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This
usually indicates unclean termination of a previous run, or service
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]:
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found
left-over process 15648 (podman) in control group while starting
unit. Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This
usually indicates unclean termination of a previous run, or service
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]:
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found
left-over process 15792 (bash) in control group while starting unit.
Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This
usually indicates unclean termination of a previous run, or service
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]:
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found
left-over process 15794 (podman) in control group while starting
unit. Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This
usually indicates unclean termination of a previous run, or service
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]:
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found
left-over process 25561 (bash) in control group while starting unit.
Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This
usually indicates unclean termination of a previous run, or service
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]:
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found
left-over process 25563 (podman) in control group while starting
unit. Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This
usually indicates unclean termination of a previous run, or service
implementation deficiencies.
Patrick
Le 21/09/2023 à 12:44, Igor Fedotov a écrit :
Hi Patrick,
please share osd restart log to investigate that.
Thanks,
Igor
On 21/09/2023 13:41, Patrick Begou wrote:
Hi,
After a power outage on my test ceph cluster, 2 osd fail to
restart. The log file show:
8e5f-00266cf8869c@osd.2.service: Failed with result 'timeout'.
Sep 21 11:55:02 mostha1 systemd[1]: Failed to start Ceph osd.2 for
250f9864-0142-11ee-8e5f-00266cf8869c.
Sep 21 11:55:12 mostha1 systemd[1]:
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Service
RestartSec=10s expired, scheduling restart.
Sep 21 11:55:12 mostha1 systemd[1]:
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Scheduled
restart job, restart counter is at 2.
Sep 21 11:55:12 mostha1 systemd[1]: Stopped Ceph osd.2 for
250f9864-0142-11ee-8e5f-00266cf8869c.
Sep 21 11:55:12 mostha1 systemd[1]:
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found
left-over process 1858 (bash) in control group while starting
unit. Ignoring.
Sep 21 11:55:12 mostha1 systemd[1]: This usually indicates unclean
termination of a previous run, or service implementation
deficiencies.
Sep 21 11:55:12 mostha1 systemd[1]:
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found
left-over process 2815 (podman) in control group while starting
unit. Ignoring.
This is not critical as it is a test cluster and it is actually
rebalancing on other osd but I would like to know how to return to
HEALTH_OK status.
Smartctl show the HDD are OK.
So is there a way to recover the osd from this state ? Version is
15.2.17 (juste moved from 15.2.13 to 15.2.17 yesterday, will try
to move to latest versions as soon as this problem is solved)
Thanks
Patrick
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
Eneko Lacunza
Zuzendari teknikoa | Director técnico
Binovo IT Human Project
Tel. +34 943 569 206 | https://www.binovo.es
Astigarragako Bidea, 2 - 2º izda. Oficina 10-11, 20180 Oiartzun
https://www.youtube.com/user/CANALBINOVO
https://www.linkedin.com/company/37269706/
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx