Re: OSD stuck down

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Curt,

I increased the debug level but still the OSD daemon doesn't log anything more than I already posted. dmesg does not report anything suspect (the osd disk has the very same messages as other disks for working osds), and smart is not very helpful:


# smartctl -a /dev/sdf
smartctl 7.1 2020-04-05 r5049 [x86_64-linux-4.18.0-477.13.1.el8_8.x86_64] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               WDC
Product:              WD2002FYPS-02W3B
Revision:             R001
Compliance:           SPC-3
User Capacity:        2,000,398,934,016 bytes [2.00 TB]
Logical block size:   512 bytes
Rotation Rate:        10000 rpm
Logical Unit id:      0x0004d927fffff850
Serial number:        WD-WCAVY7349539
Device type:          disk
Transport protocol:   Fibre channel (FCP-2)
Local Time is:        Thu Jun 15 11:52:57 2023 CEST
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Disabled or Not Supported

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK
Current Drive Temperature:     0 C
Drive Trip Temperature:        0 C

Elements in grown defect list: 0

Error Counter logging not supported

Device does not support Self Test logging


The only suspect thing I found is that the systemd log for the culprit OSD is somehow missing a part w.r.t. to the other OSDs. Here is the problematic one:

Jun 15 11:59:43 balin systemd[1]: Started Ceph osd.34 for b1029256-7bb3-11ec-a8ce-ac1f6b627b45. Jun 15 12:00:06 balin bash[2776]: Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-34 Jun 15 12:00:06 balin bash[2776]: Running command: /usr/bin/ceph-bluestore-tool prime-osd-dir --path /var/lib/ceph/osd/ceph-34 --no-mon-config --dev /dev/mapper/ceph--9a4c3927--d3da--4b49--80fe--6cdc00c7897c-osd--block--36d2f793--e5c7--4247--a314--bcc40389d50d Jun 15 12:00:06 balin bash[2776]: Running command: /usr/bin/chown -h ceph:ceph /dev/mapper/ceph--9a4c3927--d3da--4b49--80fe--6cdc00c7897c-osd--block--36d2f793--e5c7--4247--a314--bcc40389d50d Jun 15 12:00:06 balin bash[2776]: Running command: /usr/bin/chown -R ceph:ceph /dev/dm-6 Jun 15 12:00:06 balin bash[2776]: Running command: /usr/bin/ln -s /dev/mapper/ceph--9a4c3927--d3da--4b49--80fe--6cdc00c7897c-osd--block--36d2f793--e5c7--4247--a314--bcc40389d50d /var/lib/ceph/osd/ceph-34/block Jun 15 12:00:06 balin bash[2776]: Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-34 Jun 15 12:00:06 balin bash[2776]: --> ceph-volume raw activate successful for osd ID: 34 Jun 15 12:00:12 balin bash[5536]: debug 2023-06-15T10:00:12.977+0000 7f8e1c57b540 -1 Falling back to public interface

while for all the other OSDs it looks like:

Jun 15 11:59:43 balin systemd[1]: Started Ceph osd.29 for b1029256-7bb3-11ec-a8ce-ac1f6b627b45. Jun 15 12:00:06 balin bash[2793]: Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-29 Jun 15 12:00:06 balin bash[2793]: Running command: /usr/bin/ceph-bluestore-tool prime-osd-dir --path /var/lib/ceph/osd/ceph-29 --no-mon-config --dev /dev/mapper/ceph--06d03e18--2c8b--48a1--9bf6--7de5ff16af83-osd--block--5be3d54e--1fc8--400f--8664--b7d0d509f9b5 Jun 15 12:00:06 balin bash[2793]: Running command: /usr/bin/chown -h ceph:ceph /dev/mapper/ceph--06d03e18--2c8b--48a1--9bf6--7de5ff16af83-osd--block--5be3d54e--1fc8--400f--8664--b7d0d509f9b5 Jun 15 12:00:06 balin bash[2793]: Running command: /usr/bin/chown -R ceph:ceph /dev/dm-5 Jun 15 12:00:06 balin bash[2793]: Running command: /usr/bin/ln -s /dev/mapper/ceph--06d03e18--2c8b--48a1--9bf6--7de5ff16af83-osd--block--5be3d54e--1fc8--400f--8664--b7d0d509f9b5 /var/lib/ceph/osd/ceph-29/block Jun 15 12:00:06 balin bash[2793]: Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-29 Jun 15 12:00:06 balin bash[2793]: --> ceph-volume raw activate successful for osd ID: 29 Jun 15 12:00:15 balin bash[5687]: debug 2023-06-15T10:00:15.093+0000 7fdbc133a540 -1 Falling back to public interface Jun 15 12:00:36 balin bash[5687]: debug 2023-06-15T10:00:36.528+0000 7fdbc133a540 -1 bdev(0x55bc3ba7dc00 /var/lib/ceph/osd/ceph-29/block) read_random stalled read 0x101c28398f~1149 (buffered) since 357.9605221s, timeout is 5.0000000s Jun 15 12:00:42 balin bash[5687]: debug 2023-06-15T10:00:42.915+0000 7fdbc133a540 -1 bdev(0x55bc3ba7dc00 /var/lib/ceph/osd/ceph-29/block) read_random stalled read 0x1236eeff30~f4f (buffered) since 364.5580473s, timeout is 5.0000000s Jun 15 12:01:10 balin bash[5687]: debug 2023-06-15T10:01:10.767+0000 7fdbc133a540 -1 bdev(0x55bc3ba7dc00 /var/lib/ceph/osd/ceph-29/block) read stalled read 0x4544a98000~8000 (buffered) since 392.2135163s, timeout is 5.0000000s Jun 15 12:01:12 balin bash[5687]: debug 2023-06-15T10:01:12.414+0000 7fdbc133a540 -1 osd.29 158243 log_to_monitors true Jun 15 12:01:17 balin bash[5687]: debug 2023-06-15T10:01:17.454+0000 7fdbb211c700 -1 osd.29 158243 set_numa_affinity unable to identify public interface '' numa node: (2) No such file or directory

So it looks like the osd stops after "Falling back to public interface". Actually I see one OSD process that uses no CPU and much less memory than the others. So probably the osd is partially stuck, maybe waiting for the disk but correctly communicating with the cluster... I'll replace the disk and see if that helps.

Nicola

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux