Re: After power outage, osd do not restart

Eneko Lacunza <elacunza@xxxxxxxxx> · Thu, 21 Sep 2023 16:31:21 +0200

Hi Patrick,

It seems your disk or controller are damaged. Are other disks connected 
to the same controller working ok? If so, I'd say disk is dead.

Cheers

El 21/9/23 a las 16:17, Patrick Begou escribió:
Hi Igor,

a "systemctl reset-failed" doesn't restart the osd.

I reboot the node and now it show some error on the HDD:

[  107.716769] ata3.00: exception Emask 0x0 SAct 0x800000 SErr 0x0 
action 0x0
[  107.716782] ata3.00: irq_stat 0x40000008
[  107.716787] ata3.00: failed command: READ FPDMA QUEUED
[  107.716791] ata3.00: cmd 60/00:b8:00:a8:08/08:00:0e:00:00/40 tag 23 
ncq dma 1048576 in
                        res 41/40:00:c2:ad:08/00:00:0e:00:00/40 Emask 
0x409 (media error) <F>
[  107.716802] ata3.00: status: { DRDY ERR }
[  107.716806] ata3.00: error: { UNC }
[  107.728547] ata3.00: configured for UDMA/133
[  107.728575] sd 2:0:0:0: [sda] tag#23 FAILED Result: hostbyte=DID_OK 
driverbyte=DRIVER_OK cmd_age=1s
[  107.728581] sd 2:0:0:0: [sda] tag#23 Sense Key : Medium Error 
[current]
[  107.728585] sd 2:0:0:0: [sda] tag#23 Add. Sense: Unrecovered read 
error - auto reallocate failed
[  107.728590] sd 2:0:0:0: [sda] tag#23 CDB: Read(10) 28 00 0e 08 a8 
00 00 08 00 00
[  107.728592] I/O error, dev sda, sector 235449794 op 0x0:(READ) 
flags 0x80700 phys_seg 6 prio class 2
[  107.728623] ata3: EH complete
[  109.203256] ata3.00: exception Emask 0x0 SAct 0x20000000 SErr 0x0 
action 0x0
[  109.203268] ata3.00: irq_stat 0x40000008
[  109.203274] ata3.00: failed command: READ FPDMA QUEUED
[  109.203277] ata3.00: cmd 60/08:e8:48:ad:08/00:00:0e:00:00/40 tag 29 
ncq dma 4096 in
                        res 41/40:00:48:ad:08/00:00:0e:00:00/40 Emask 
0x409 (media error) <F>
[  109.203289] ata3.00: status: { DRDY ERR }
[  109.203292] ata3.00: error: { UNC }

I think the storage is corrupted and I have te reset it all.

Patrick

Le 21/09/2023 à 13:32, Igor Fedotov a écrit :

May be execute systemctl reset-failed <...> or even restart the node?

On 21/09/2023 14:26, Patrick Begou wrote:
Hi Igor,

the ceph-osd.2.log remains empty on the node where this osd is 
located. This is what I get when manualy restarting the osd.

[root@mostha1 250f9864-0142-11ee-8e5f-00266cf8869c]# systemctl 
restart ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service
Job for ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service 
failed because a timeout was exceeded.
See "systemctl status 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service" and 
"journalctl -xe" for details.
[root@mostha1 250f9864-0142-11ee-8e5f-00266cf8869c]# journalctl -xe
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found 
left-over process 5728 (podman) in control group while starting 
unit. Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This 
usually indicates unclean termination of a previous run, or service 
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found 
left-over process 5882 (bash) in control group while starting unit. 
Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This 
usually indicates unclean termination of a previous run, or service 
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found 
left-over process 5884 (podman) in control group while starting 
unit. Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This 
usually indicates unclean termination of a previous run, or service 
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found 
left-over process 6031 (bash) in control group while starting unit. 
Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This 
usually indicates unclean termination of a previous run, or service 
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found 
left-over process 6033 (podman) in control group while starting 
unit. Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This 
usually indicates unclean termination of a previous run, or service 
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found 
left-over process 6185 (bash) in control group while starting unit. 
Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This 
usually indicates unclean termination of a previous run, or service 
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found 
left-over process 6187 (podman) in control group while starting 
unit. Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This 
usually indicates unclean termination of a previous run, or service 
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found 
left-over process 14627 (bash) in control group while starting unit. 
Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This 
usually indicates unclean termination of a previous run, or service 
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found 
left-over process 14629 (podman) in control group while starting 
unit. Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This 
usually indicates unclean termination of a previous run, or service 
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found 
left-over process 14776 (bash) in control group while starting unit. 
Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This 
usually indicates unclean termination of a previous run, or service 
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found 
left-over process 14778 (podman) in control group while starting 
unit. Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This 
usually indicates unclean termination of a previous run, or service 
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found 
left-over process 15169 (bash) in control group while starting unit. 
Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This 
usually indicates unclean termination of a previous run, or service 
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found 
left-over process 15171 (podman) in control group while starting 
unit. Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This 
usually indicates unclean termination of a previous run, or service 
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found 
left-over process 15646 (bash) in control group while starting unit. 
Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This 
usually indicates unclean termination of a previous run, or service 
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found 
left-over process 15648 (podman) in control group while starting 
unit. Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This 
usually indicates unclean termination of a previous run, or service 
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found 
left-over process 15792 (bash) in control group while starting unit. 
Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This 
usually indicates unclean termination of a previous run, or service 
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found 
left-over process 15794 (podman) in control group while starting 
unit. Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This 
usually indicates unclean termination of a previous run, or service 
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found 
left-over process 25561 (bash) in control group while starting unit. 
Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This 
usually indicates unclean termination of a previous run, or service 
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found 
left-over process 25563 (podman) in control group while starting 
unit. Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This 
usually indicates unclean termination of a previous run, or service 
implementation deficiencies.

Patrick

Le 21/09/2023 à 12:44, Igor Fedotov a écrit :
Hi Patrick,

please share osd restart log to investigate that.

Thanks,

Igor

On 21/09/2023 13:41, Patrick Begou wrote:
Hi,

After a power outage on my test ceph cluster, 2 osd fail to 
restart.  The log file show:

8e5f-00266cf8869c@osd.2.service: Failed with result 'timeout'.
Sep 21 11:55:02 mostha1 systemd[1]: Failed to start Ceph osd.2 for 
250f9864-0142-11ee-8e5f-00266cf8869c.
Sep 21 11:55:12 mostha1 systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Service 
RestartSec=10s expired, scheduling restart.
Sep 21 11:55:12 mostha1 systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Scheduled 
restart job, restart counter is at 2.
Sep 21 11:55:12 mostha1 systemd[1]: Stopped Ceph osd.2 for 
250f9864-0142-11ee-8e5f-00266cf8869c.
Sep 21 11:55:12 mostha1 systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found 
left-over process 1858 (bash) in control group while starting 
unit. Ignoring.
Sep 21 11:55:12 mostha1 systemd[1]: This usually indicates unclean 
termination of a previous run, or service implementation 
deficiencies.
Sep 21 11:55:12 mostha1 systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found 
left-over process 2815 (podman) in control group while starting 
unit. Ignoring.

This is not critical as it is a test cluster and it is actually 
rebalancing on other osd but I would like to know how to return to 
HEALTH_OK status.

Smartctl show the HDD are OK.

So is there a way to recover the osd from this state ? Version is 
15.2.17 (juste moved from 15.2.13 to 15.2.17 yesterday, will try 
to move to latest versions as soon as this problem is solved)

Thanks

Patrick

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

Eneko Lacunza
Zuzendari teknikoa | Director técnico
Binovo IT Human Project

Tel. +34 943 569 206 | https://www.binovo.es
Astigarragako Bidea, 2 - 2º izda. Oficina 10-11, 20180 Oiartzun

https://www.youtube.com/user/CANALBINOVO
https://www.linkedin.com/company/37269706/
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx