Re: After power outage, osd do not restart

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Eneko,

I do not work on the ceph cluster since my last email (making some user support) and now the osd.2 is back in the cluster:

 -7         0.68217      host mostha1
  2    hdd  0.22739          osd.2           up   1.00000  1.00000
  5    hdd  0.45479          osd.5           up   1.00000  1.00000

May be the reboot suggested by Igor ?

I will try to solve my last problem now. While upgrading from 15.2.13 to 15.2.17 I hit a memory problem on one node (these are old computers used to learn Ceph). Upgrading one of the osd fails and it locks the upgrade as Ceph did not accept to stop and upgrade next osd in the cluster. But Ceph start rebalancing the data and magicaly finishes the upgrade. But a last osd is still down and out and it is a daemon problem as smartctl returns a good health for the HDD. I've changed the faulty memory dims and the node is back in the cluster. So this is my new challenge 😁

Using old material (2011) for learning seams fine to investigate Ceph reliability as many problems can raise up  but at no risks!

Patrick



Le 21/09/2023 à 16:31, Eneko Lacunza a écrit :
Hi Patrick,

It seems your disk or controller are damaged. Are other disks connected to the same controller working ok? If so, I'd say disk is dead.

Cheers

El 21/9/23 a las 16:17, Patrick Begou escribió:
Hi Igor,

a "systemctl reset-failed" doesn't restart the osd.

I reboot the node and now it show some error on the HDD:

[  107.716769] ata3.00: exception Emask 0x0 SAct 0x800000 SErr 0x0 action 0x0
[  107.716782] ata3.00: irq_stat 0x40000008
[  107.716787] ata3.00: failed command: READ FPDMA QUEUED
[  107.716791] ata3.00: cmd 60/00:b8:00:a8:08/08:00:0e:00:00/40 tag 23 ncq dma 1048576 in                         res 41/40:00:c2:ad:08/00:00:0e:00:00/40 Emask 0x409 (media error) <F>
[  107.716802] ata3.00: status: { DRDY ERR }
[  107.716806] ata3.00: error: { UNC }
[  107.728547] ata3.00: configured for UDMA/133
[  107.728575] sd 2:0:0:0: [sda] tag#23 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=1s [  107.728581] sd 2:0:0:0: [sda] tag#23 Sense Key : Medium Error [current] [  107.728585] sd 2:0:0:0: [sda] tag#23 Add. Sense: Unrecovered read error - auto reallocate failed [  107.728590] sd 2:0:0:0: [sda] tag#23 CDB: Read(10) 28 00 0e 08 a8 00 00 08 00 00 [  107.728592] I/O error, dev sda, sector 235449794 op 0x0:(READ) flags 0x80700 phys_seg 6 prio class 2
[  107.728623] ata3: EH complete
[  109.203256] ata3.00: exception Emask 0x0 SAct 0x20000000 SErr 0x0 action 0x0
[  109.203268] ata3.00: irq_stat 0x40000008
[  109.203274] ata3.00: failed command: READ FPDMA QUEUED
[  109.203277] ata3.00: cmd 60/08:e8:48:ad:08/00:00:0e:00:00/40 tag 29 ncq dma 4096 in                         res 41/40:00:48:ad:08/00:00:0e:00:00/40 Emask 0x409 (media error) <F>
[  109.203289] ata3.00: status: { DRDY ERR }
[  109.203292] ata3.00: error: { UNC }



I think the storage is corrupted and I have te reset it all.

Patrick


Le 21/09/2023 à 13:32, Igor Fedotov a écrit :

May be execute systemctl reset-failed <...> or even restart the node?


On 21/09/2023 14:26, Patrick Begou wrote:
Hi Igor,

the ceph-osd.2.log remains empty on the node where this osd is located. This is what I get when manualy restarting the osd.

[root@mostha1 250f9864-0142-11ee-8e5f-00266cf8869c]# systemctl restart ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service Job for ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service failed because a timeout was exceeded. See "systemctl status ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service" and "journalctl -xe" for details.
[root@mostha1 250f9864-0142-11ee-8e5f-00266cf8869c]# journalctl -xe
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found left-over process 5728 (podman) in control group while starting unit. Ignoring. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found left-over process 5882 (bash) in control group while starting unit. Ignoring. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found left-over process 5884 (podman) in control group while starting unit. Ignoring. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found left-over process 6031 (bash) in control group while starting unit. Ignoring. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found left-over process 6033 (podman) in control group while starting unit. Ignoring. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found left-over process 6185 (bash) in control group while starting unit. Ignoring. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found left-over process 6187 (podman) in control group while starting unit. Ignoring. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found left-over process 14627 (bash) in control group while starting unit. Ignoring. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found left-over process 14629 (podman) in control group while starting unit. Ignoring. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found left-over process 14776 (bash) in control group while starting unit. Ignoring. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found left-over process 14778 (podman) in control group while starting unit. Ignoring. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found left-over process 15169 (bash) in control group while starting unit. Ignoring. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found left-over process 15171 (podman) in control group while starting unit. Ignoring. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found left-over process 15646 (bash) in control group while starting unit. Ignoring. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found left-over process 15648 (podman) in control group while starting unit. Ignoring. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found left-over process 15792 (bash) in control group while starting unit. Ignoring. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found left-over process 15794 (podman) in control group while starting unit. Ignoring. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found left-over process 25561 (bash) in control group while starting unit. Ignoring. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found left-over process 25563 (podman) in control group while starting unit. Ignoring. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.

Patrick

Le 21/09/2023 à 12:44, Igor Fedotov a écrit :
Hi Patrick,

please share osd restart log to investigate that.


Thanks,

Igor

On 21/09/2023 13:41, Patrick Begou wrote:
Hi,

After a power outage on my test ceph cluster, 2 osd fail to restart.  The log file show:

8e5f-00266cf8869c@osd.2.service: Failed with result 'timeout'.
Sep 21 11:55:02 mostha1 systemd[1]: Failed to start Ceph osd.2 for 250f9864-0142-11ee-8e5f-00266cf8869c. Sep 21 11:55:12 mostha1 systemd[1]: ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Service RestartSec=10s expired, scheduling restart. Sep 21 11:55:12 mostha1 systemd[1]: ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Scheduled restart job, restart counter is at 2. Sep 21 11:55:12 mostha1 systemd[1]: Stopped Ceph osd.2 for 250f9864-0142-11ee-8e5f-00266cf8869c. Sep 21 11:55:12 mostha1 systemd[1]: ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found left-over process 1858 (bash) in control group while starting unit. Ignoring. Sep 21 11:55:12 mostha1 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies. Sep 21 11:55:12 mostha1 systemd[1]: ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found left-over process 2815 (podman) in control group while starting unit. Ignoring.

This is not critical as it is a test cluster and it is actually rebalancing on other osd but I would like to know how to return to HEALTH_OK status.

Smartctl show the HDD are OK.

So is there a way to recover the osd from this state ? Version is 15.2.17 (juste moved from 15.2.13 to 15.2.17 yesterday, will try to move to latest versions as soon as this problem is solved)

Thanks

Patrick

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx


_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

Eneko Lacunza
Zuzendari teknikoa | Director técnico
Binovo IT Human Project

Tel. +34 943 569 206 | https://www.binovo.es
Astigarragako Bidea, 2 - 2º izda. Oficina 10-11, 20180 Oiartzun

https://www.youtube.com/user/CANALBINOVO
https://www.linkedin.com/company/37269706/

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux