Re: After power outage, osd do not restart

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Igor,

a "systemctl reset-failed" doesn't restart the osd.

I reboot the node and now it show some error on the HDD:

[  107.716769] ata3.00: exception Emask 0x0 SAct 0x800000 SErr 0x0 action 0x0
[  107.716782] ata3.00: irq_stat 0x40000008
[  107.716787] ata3.00: failed command: READ FPDMA QUEUED
[  107.716791] ata3.00: cmd 60/00:b8:00:a8:08/08:00:0e:00:00/40 tag 23 ncq dma 1048576 in                         res 41/40:00:c2:ad:08/00:00:0e:00:00/40 Emask 0x409 (media error) <F>
[  107.716802] ata3.00: status: { DRDY ERR }
[  107.716806] ata3.00: error: { UNC }
[  107.728547] ata3.00: configured for UDMA/133
[  107.728575] sd 2:0:0:0: [sda] tag#23 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=1s
[  107.728581] sd 2:0:0:0: [sda] tag#23 Sense Key : Medium Error [current]
[  107.728585] sd 2:0:0:0: [sda] tag#23 Add. Sense: Unrecovered read error - auto reallocate failed [  107.728590] sd 2:0:0:0: [sda] tag#23 CDB: Read(10) 28 00 0e 08 a8 00 00 08 00 00 [  107.728592] I/O error, dev sda, sector 235449794 op 0x0:(READ) flags 0x80700 phys_seg 6 prio class 2
[  107.728623] ata3: EH complete
[  109.203256] ata3.00: exception Emask 0x0 SAct 0x20000000 SErr 0x0 action 0x0
[  109.203268] ata3.00: irq_stat 0x40000008
[  109.203274] ata3.00: failed command: READ FPDMA QUEUED
[  109.203277] ata3.00: cmd 60/08:e8:48:ad:08/00:00:0e:00:00/40 tag 29 ncq dma 4096 in                         res 41/40:00:48:ad:08/00:00:0e:00:00/40 Emask 0x409 (media error) <F>
[  109.203289] ata3.00: status: { DRDY ERR }
[  109.203292] ata3.00: error: { UNC }



I think the storage is corrupted and I have te reset it all.

Patrick


Le 21/09/2023 à 13:32, Igor Fedotov a écrit :

May be execute systemctl reset-failed <...> or even restart the node?


On 21/09/2023 14:26, Patrick Begou wrote:
Hi Igor,

the ceph-osd.2.log remains empty on the node where this osd is located. This is what I get when manualy restarting the osd.

[root@mostha1 250f9864-0142-11ee-8e5f-00266cf8869c]# systemctl restart ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service Job for ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service failed because a timeout was exceeded. See "systemctl status ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service" and "journalctl -xe" for details.
[root@mostha1 250f9864-0142-11ee-8e5f-00266cf8869c]# journalctl -xe
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found left-over process 5728 (podman) in control group while starting unit. Ignoring. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found left-over process 5882 (bash) in control group while starting unit. Ignoring. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found left-over process 5884 (podman) in control group while starting unit. Ignoring. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found left-over process 6031 (bash) in control group while starting unit. Ignoring. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found left-over process 6033 (podman) in control group while starting unit. Ignoring. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found left-over process 6185 (bash) in control group while starting unit. Ignoring. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found left-over process 6187 (podman) in control group while starting unit. Ignoring. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found left-over process 14627 (bash) in control group while starting unit. Ignoring. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found left-over process 14629 (podman) in control group while starting unit. Ignoring. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found left-over process 14776 (bash) in control group while starting unit. Ignoring. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found left-over process 14778 (podman) in control group while starting unit. Ignoring. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found left-over process 15169 (bash) in control group while starting unit. Ignoring. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found left-over process 15171 (podman) in control group while starting unit. Ignoring. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found left-over process 15646 (bash) in control group while starting unit. Ignoring. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found left-over process 15648 (podman) in control group while starting unit. Ignoring. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found left-over process 15792 (bash) in control group while starting unit. Ignoring. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found left-over process 15794 (podman) in control group while starting unit. Ignoring. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found left-over process 25561 (bash) in control group while starting unit. Ignoring. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found left-over process 25563 (podman) in control group while starting unit. Ignoring. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.

Patrick

Le 21/09/2023 à 12:44, Igor Fedotov a écrit :
Hi Patrick,

please share osd restart log to investigate that.


Thanks,

Igor

On 21/09/2023 13:41, Patrick Begou wrote:
Hi,

After a power outage on my test ceph cluster, 2 osd fail to restart.  The log file show:

8e5f-00266cf8869c@osd.2.service: Failed with result 'timeout'.
Sep 21 11:55:02 mostha1 systemd[1]: Failed to start Ceph osd.2 for 250f9864-0142-11ee-8e5f-00266cf8869c. Sep 21 11:55:12 mostha1 systemd[1]: ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Service RestartSec=10s expired, scheduling restart. Sep 21 11:55:12 mostha1 systemd[1]: ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Scheduled restart job, restart counter is at 2. Sep 21 11:55:12 mostha1 systemd[1]: Stopped Ceph osd.2 for 250f9864-0142-11ee-8e5f-00266cf8869c. Sep 21 11:55:12 mostha1 systemd[1]: ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found left-over process 1858 (bash) in control group while starting unit. Ignoring. Sep 21 11:55:12 mostha1 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies. Sep 21 11:55:12 mostha1 systemd[1]: ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found left-over process 2815 (podman) in control group while starting unit. Ignoring.

This is not critical as it is a test cluster and it is actually rebalancing on other osd but I would like to know how to return to HEALTH_OK status.

Smartctl show the HDD are OK.

So is there a way to recover the osd from this state ? Version is 15.2.17 (juste moved from 15.2.13 to 15.2.17 yesterday, will try to move to latest versions as soon as this problem is solved)

Thanks

Patrick

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx


_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux