Hi Eneko,
I do not work on the ceph cluster since my last email (making some user
support) and now the osd.2 is back in the cluster:
-7 0.68217 host mostha1
2 hdd 0.22739 osd.2 up 1.00000 1.00000
5 hdd 0.45479 osd.5 up 1.00000 1.00000
May be the reboot suggested by Igor ?
I will try to solve my last problem now. While upgrading from 15.2.13 to
15.2.17 I hit a memory problem on one node (these are old computers used
to learn Ceph).
Upgrading one of the osd fails and it locks the upgrade as Ceph did not
accept to stop and upgrade next osd in the cluster. But Ceph start
rebalancing the data and magicaly finishes the upgrade.
But a last osd is still down and out and it is a daemon problem as
smartctl returns a good health for the HDD.
I've changed the faulty memory dims and the node is back in the cluster.
So this is my new challenge 😁
Using old material (2011) for learning seams fine to investigate Ceph
reliability as many problems can raise up but at no risks!
Patrick
Le 21/09/2023 à 16:31, Eneko Lacunza a écrit :
Hi Patrick,
It seems your disk or controller are damaged. Are other disks
connected to the same controller working ok? If so, I'd say disk is dead.
Cheers
El 21/9/23 a las 16:17, Patrick Begou escribió:
Hi Igor,
a "systemctl reset-failed" doesn't restart the osd.
I reboot the node and now it show some error on the HDD:
[ 107.716769] ata3.00: exception Emask 0x0 SAct 0x800000 SErr 0x0
action 0x0
[ 107.716782] ata3.00: irq_stat 0x40000008
[ 107.716787] ata3.00: failed command: READ FPDMA QUEUED
[ 107.716791] ata3.00: cmd 60/00:b8:00:a8:08/08:00:0e:00:00/40 tag
23 ncq dma 1048576 in
res 41/40:00:c2:ad:08/00:00:0e:00:00/40 Emask
0x409 (media error) <F>
[ 107.716802] ata3.00: status: { DRDY ERR }
[ 107.716806] ata3.00: error: { UNC }
[ 107.728547] ata3.00: configured for UDMA/133
[ 107.728575] sd 2:0:0:0: [sda] tag#23 FAILED Result:
hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=1s
[ 107.728581] sd 2:0:0:0: [sda] tag#23 Sense Key : Medium Error
[current]
[ 107.728585] sd 2:0:0:0: [sda] tag#23 Add. Sense: Unrecovered read
error - auto reallocate failed
[ 107.728590] sd 2:0:0:0: [sda] tag#23 CDB: Read(10) 28 00 0e 08 a8
00 00 08 00 00
[ 107.728592] I/O error, dev sda, sector 235449794 op 0x0:(READ)
flags 0x80700 phys_seg 6 prio class 2
[ 107.728623] ata3: EH complete
[ 109.203256] ata3.00: exception Emask 0x0 SAct 0x20000000 SErr 0x0
action 0x0
[ 109.203268] ata3.00: irq_stat 0x40000008
[ 109.203274] ata3.00: failed command: READ FPDMA QUEUED
[ 109.203277] ata3.00: cmd 60/08:e8:48:ad:08/00:00:0e:00:00/40 tag
29 ncq dma 4096 in
res 41/40:00:48:ad:08/00:00:0e:00:00/40 Emask
0x409 (media error) <F>
[ 109.203289] ata3.00: status: { DRDY ERR }
[ 109.203292] ata3.00: error: { UNC }
I think the storage is corrupted and I have te reset it all.
Patrick
Le 21/09/2023 à 13:32, Igor Fedotov a écrit :
May be execute systemctl reset-failed <...> or even restart the node?
On 21/09/2023 14:26, Patrick Begou wrote:
Hi Igor,
the ceph-osd.2.log remains empty on the node where this osd is
located. This is what I get when manualy restarting the osd.
[root@mostha1 250f9864-0142-11ee-8e5f-00266cf8869c]# systemctl
restart ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service
Job for ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service
failed because a timeout was exceeded.
See "systemctl status
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service" and
"journalctl -xe" for details.
[root@mostha1 250f9864-0142-11ee-8e5f-00266cf8869c]# journalctl -xe
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]:
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found
left-over process 5728 (podman) in control group while starting
unit. Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This
usually indicates unclean termination of a previous run, or service
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]:
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found
left-over process 5882 (bash) in control group while starting unit.
Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This
usually indicates unclean termination of a previous run, or service
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]:
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found
left-over process 5884 (podman) in control group while starting
unit. Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This
usually indicates unclean termination of a previous run, or service
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]:
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found
left-over process 6031 (bash) in control group while starting unit.
Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This
usually indicates unclean termination of a previous run, or service
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]:
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found
left-over process 6033 (podman) in control group while starting
unit. Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This
usually indicates unclean termination of a previous run, or service
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]:
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found
left-over process 6185 (bash) in control group while starting unit.
Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This
usually indicates unclean termination of a previous run, or service
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]:
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found
left-over process 6187 (podman) in control group while starting
unit. Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This
usually indicates unclean termination of a previous run, or service
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]:
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found
left-over process 14627 (bash) in control group while starting
unit. Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This
usually indicates unclean termination of a previous run, or service
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]:
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found
left-over process 14629 (podman) in control group while starting
unit. Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This
usually indicates unclean termination of a previous run, or service
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]:
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found
left-over process 14776 (bash) in control group while starting
unit. Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This
usually indicates unclean termination of a previous run, or service
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]:
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found
left-over process 14778 (podman) in control group while starting
unit. Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This
usually indicates unclean termination of a previous run, or service
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]:
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found
left-over process 15169 (bash) in control group while starting
unit. Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This
usually indicates unclean termination of a previous run, or service
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]:
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found
left-over process 15171 (podman) in control group while starting
unit. Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This
usually indicates unclean termination of a previous run, or service
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]:
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found
left-over process 15646 (bash) in control group while starting
unit. Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This
usually indicates unclean termination of a previous run, or service
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]:
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found
left-over process 15648 (podman) in control group while starting
unit. Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This
usually indicates unclean termination of a previous run, or service
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]:
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found
left-over process 15792 (bash) in control group while starting
unit. Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This
usually indicates unclean termination of a previous run, or service
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]:
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found
left-over process 15794 (podman) in control group while starting
unit. Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This
usually indicates unclean termination of a previous run, or service
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]:
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found
left-over process 25561 (bash) in control group while starting
unit. Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This
usually indicates unclean termination of a previous run, or service
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]:
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found
left-over process 25563 (podman) in control group while starting
unit. Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This
usually indicates unclean termination of a previous run, or service
implementation deficiencies.
Patrick
Le 21/09/2023 à 12:44, Igor Fedotov a écrit :
Hi Patrick,
please share osd restart log to investigate that.
Thanks,
Igor
On 21/09/2023 13:41, Patrick Begou wrote:
Hi,
After a power outage on my test ceph cluster, 2 osd fail to
restart. The log file show:
8e5f-00266cf8869c@osd.2.service: Failed with result 'timeout'.
Sep 21 11:55:02 mostha1 systemd[1]: Failed to start Ceph osd.2
for 250f9864-0142-11ee-8e5f-00266cf8869c.
Sep 21 11:55:12 mostha1 systemd[1]:
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Service
RestartSec=10s expired, scheduling restart.
Sep 21 11:55:12 mostha1 systemd[1]:
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service:
Scheduled restart job, restart counter is at 2.
Sep 21 11:55:12 mostha1 systemd[1]: Stopped Ceph osd.2 for
250f9864-0142-11ee-8e5f-00266cf8869c.
Sep 21 11:55:12 mostha1 systemd[1]:
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found
left-over process 1858 (bash) in control group while starting
unit. Ignoring.
Sep 21 11:55:12 mostha1 systemd[1]: This usually indicates
unclean termination of a previous run, or service implementation
deficiencies.
Sep 21 11:55:12 mostha1 systemd[1]:
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found
left-over process 2815 (podman) in control group while starting
unit. Ignoring.
This is not critical as it is a test cluster and it is actually
rebalancing on other osd but I would like to know how to return
to HEALTH_OK status.
Smartctl show the HDD are OK.
So is there a way to recover the osd from this state ? Version is
15.2.17 (juste moved from 15.2.13 to 15.2.17 yesterday, will try
to move to latest versions as soon as this problem is solved)
Thanks
Patrick
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
Eneko Lacunza
Zuzendari teknikoa | Director técnico
Binovo IT Human Project
Tel. +34 943 569 206 | https://www.binovo.es
Astigarragako Bidea, 2 - 2º izda. Oficina 10-11, 20180 Oiartzun
https://www.youtube.com/user/CANALBINOVO
https://www.linkedin.com/company/37269706/
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx