Re: After power outage, osd do not restart

Patrick Begou <Patrick.Begou@xxxxxxxxxxxxxxxxxxxxxx> · Thu, 21 Sep 2023 17:02:20 +0200

Hi Eneko,

I do not work on the ceph cluster since my last email (making some user 
support) and now the osd.2 is back in the cluster:

 -7         0.68217      host mostha1
  2    hdd  0.22739          osd.2           up   1.00000  1.00000
  5    hdd  0.45479          osd.5           up   1.00000  1.00000

May be the reboot suggested by Igor ?

I will try to solve my last problem now. While upgrading from 15.2.13 to 
15.2.17 I hit a memory problem on one node (these are old computers used 
to learn Ceph).
Upgrading one of the osd fails and it locks the upgrade as Ceph did not 
accept to stop and upgrade next osd in the cluster. But Ceph start 
rebalancing the data and magicaly finishes the upgrade.
But a last osd is still down and out and it is a daemon problem as 
smartctl returns a good health for the HDD.
I've changed the faulty memory dims and the node is back in the cluster. 
So this is my new challenge 😁

Using old material (2011) for learning seams fine to investigate Ceph 
reliability as many problems can raise up  but at no risks!

Patrick

Le 21/09/2023 à 16:31, Eneko Lacunza a écrit :
Hi Patrick,

It seems your disk or controller are damaged. Are other disks 
connected to the same controller working ok? If so, I'd say disk is dead.

Cheers

El 21/9/23 a las 16:17, Patrick Begou escribió:
Hi Igor,

a "systemctl reset-failed" doesn't restart the osd.

I reboot the node and now it show some error on the HDD:

[  107.716769] ata3.00: exception Emask 0x0 SAct 0x800000 SErr 0x0 
action 0x0
[  107.716782] ata3.00: irq_stat 0x40000008
[  107.716787] ata3.00: failed command: READ FPDMA QUEUED
[  107.716791] ata3.00: cmd 60/00:b8:00:a8:08/08:00:0e:00:00/40 tag 
23 ncq dma 1048576 in
                        res 41/40:00:c2:ad:08/00:00:0e:00:00/40 Emask 
0x409 (media error) <F>
[  107.716802] ata3.00: status: { DRDY ERR }
[  107.716806] ata3.00: error: { UNC }
[  107.728547] ata3.00: configured for UDMA/133
[  107.728575] sd 2:0:0:0: [sda] tag#23 FAILED Result: 
hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=1s
[  107.728581] sd 2:0:0:0: [sda] tag#23 Sense Key : Medium Error 
[current]
[  107.728585] sd 2:0:0:0: [sda] tag#23 Add. Sense: Unrecovered read 
error - auto reallocate failed
[  107.728590] sd 2:0:0:0: [sda] tag#23 CDB: Read(10) 28 00 0e 08 a8 
00 00 08 00 00
[  107.728592] I/O error, dev sda, sector 235449794 op 0x0:(READ) 
flags 0x80700 phys_seg 6 prio class 2
[  107.728623] ata3: EH complete
[  109.203256] ata3.00: exception Emask 0x0 SAct 0x20000000 SErr 0x0 
action 0x0
[  109.203268] ata3.00: irq_stat 0x40000008
[  109.203274] ata3.00: failed command: READ FPDMA QUEUED
[  109.203277] ata3.00: cmd 60/08:e8:48:ad:08/00:00:0e:00:00/40 tag 
29 ncq dma 4096 in
                        res 41/40:00:48:ad:08/00:00:0e:00:00/40 Emask 
0x409 (media error) <F>
[  109.203289] ata3.00: status: { DRDY ERR }
[  109.203292] ata3.00: error: { UNC }

I think the storage is corrupted and I have te reset it all.

Patrick

Le 21/09/2023 à 13:32, Igor Fedotov a écrit :

May be execute systemctl reset-failed <...> or even restart the node?

On 21/09/2023 14:26, Patrick Begou wrote:
Hi Igor,

the ceph-osd.2.log remains empty on the node where this osd is 
located. This is what I get when manualy restarting the osd.

[root@mostha1 250f9864-0142-11ee-8e5f-00266cf8869c]# systemctl 
restart ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service
Job for ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service 
failed because a timeout was exceeded.
See "systemctl status 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service" and 
"journalctl -xe" for details.
[root@mostha1 250f9864-0142-11ee-8e5f-00266cf8869c]# journalctl -xe
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found 
left-over process 5728 (podman) in control group while starting 
unit. Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This 
usually indicates unclean termination of a previous run, or service 
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found 
left-over process 5882 (bash) in control group while starting unit. 
Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This 
usually indicates unclean termination of a previous run, or service 
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found 
left-over process 5884 (podman) in control group while starting 
unit. Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This 
usually indicates unclean termination of a previous run, or service 
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found 
left-over process 6031 (bash) in control group while starting unit. 
Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This 
usually indicates unclean termination of a previous run, or service 
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found 
left-over process 6033 (podman) in control group while starting 
unit. Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This 
usually indicates unclean termination of a previous run, or service 
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found 
left-over process 6185 (bash) in control group while starting unit. 
Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This 
usually indicates unclean termination of a previous run, or service 
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found 
left-over process 6187 (podman) in control group while starting 
unit. Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This 
usually indicates unclean termination of a previous run, or service 
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found 
left-over process 14627 (bash) in control group while starting 
unit. Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This 
usually indicates unclean termination of a previous run, or service 
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found 
left-over process 14629 (podman) in control group while starting 
unit. Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This 
usually indicates unclean termination of a previous run, or service 
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found 
left-over process 14776 (bash) in control group while starting 
unit. Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This 
usually indicates unclean termination of a previous run, or service 
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found 
left-over process 14778 (podman) in control group while starting 
unit. Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This 
usually indicates unclean termination of a previous run, or service 
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found 
left-over process 15169 (bash) in control group while starting 
unit. Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This 
usually indicates unclean termination of a previous run, or service 
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found 
left-over process 15171 (podman) in control group while starting 
unit. Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This 
usually indicates unclean termination of a previous run, or service 
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found 
left-over process 15646 (bash) in control group while starting 
unit. Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This 
usually indicates unclean termination of a previous run, or service 
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found 
left-over process 15648 (podman) in control group while starting 
unit. Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This 
usually indicates unclean termination of a previous run, or service 
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found 
left-over process 15792 (bash) in control group while starting 
unit. Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This 
usually indicates unclean termination of a previous run, or service 
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found 
left-over process 15794 (podman) in control group while starting 
unit. Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This 
usually indicates unclean termination of a previous run, or service 
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found 
left-over process 25561 (bash) in control group while starting 
unit. Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This 
usually indicates unclean termination of a previous run, or service 
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found 
left-over process 25563 (podman) in control group while starting 
unit. Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This 
usually indicates unclean termination of a previous run, or service 
implementation deficiencies.

Patrick

Le 21/09/2023 à 12:44, Igor Fedotov a écrit :
Hi Patrick,

please share osd restart log to investigate that.

Thanks,

Igor

On 21/09/2023 13:41, Patrick Begou wrote:
Hi,

After a power outage on my test ceph cluster, 2 osd fail to 
restart.  The log file show:

8e5f-00266cf8869c@osd.2.service: Failed with result 'timeout'.
Sep 21 11:55:02 mostha1 systemd[1]: Failed to start Ceph osd.2 
for 250f9864-0142-11ee-8e5f-00266cf8869c.
Sep 21 11:55:12 mostha1 systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Service 
RestartSec=10s expired, scheduling restart.
Sep 21 11:55:12 mostha1 systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: 
Scheduled restart job, restart counter is at 2.
Sep 21 11:55:12 mostha1 systemd[1]: Stopped Ceph osd.2 for 
250f9864-0142-11ee-8e5f-00266cf8869c.
Sep 21 11:55:12 mostha1 systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found 
left-over process 1858 (bash) in control group while starting 
unit. Ignoring.
Sep 21 11:55:12 mostha1 systemd[1]: This usually indicates 
unclean termination of a previous run, or service implementation 
deficiencies.
Sep 21 11:55:12 mostha1 systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found 
left-over process 2815 (podman) in control group while starting 
unit. Ignoring.

This is not critical as it is a test cluster and it is actually 
rebalancing on other osd but I would like to know how to return 
to HEALTH_OK status.

Smartctl show the HDD are OK.

So is there a way to recover the osd from this state ? Version is 
15.2.17 (juste moved from 15.2.13 to 15.2.17 yesterday, will try 
to move to latest versions as soon as this problem is solved)

Thanks

Patrick

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

Eneko Lacunza
Zuzendari teknikoa | Director técnico
Binovo IT Human Project

Tel. +34 943 569 206 | https://www.binovo.es
Astigarragako Bidea, 2 - 2º izda. Oficina 10-11, 20180 Oiartzun

https://www.youtube.com/user/CANALBINOVO
https://www.linkedin.com/company/37269706/

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx