OSD hearbeat_check failure while using 10Gb/s

Sarunas Burdulis <sarunas@xxxxxxxxxxxxxxxxxxx> · Mon, 17 Jun 2024 11:57:25 -0400

Hi,

6 host 16 OSD cluster here, all SATA SSDs. All Ceph daemons version 
18.2.2. Host OS is Ubuntu 24.04. Intel X540 10Gb/s interfaces for 
cluster network. All is fine while using a 1Gb/s switch. When moved to 
10Gb/s switch (Netgear XS712T), OSDs, one-by-one start failing heartbeat 
checks and are marked as 'down' until only 3 or 4 OSDs remain up. By 
then cluster is unusable (slow ops, PGs inactive).

Here is a sample sequence from the log of one of the OSDs:

ceph-osd[23402]: osd.3 77434 heartbeat_check: no reply from 
129.170.x.x:6802 osd.13 ever on either front or back

ceph-osd[23402]: log_channel(cluster) log [WRN] : 101 slow requests (by 
type [ 'delayed' : 101 ] most affected pool [ 'default.rgw.log' : 96 ])

ceph-osd[23402]: log_channel(cluster) log [WRN] : Monitor daemon marked 
osd.3 down, but it is still running

ceph-osd[23402]: log_channel(cluster) log [DBG] : map e77442 wrongly 
marked me down at e77441

ceph-osd[23402]: osd.3 77442 start_waiting_for_healthy

ceph-osd[23402]: osd.3 77434 is_healthy false -- only 0/10 up peers 
(less than 33%)

ceph-osd[23402]: osd.3 77434 not healthy; waiting to boot

OSD service container keeps running, but it is not booted.

Has anyone experienced this? Any ideas on what should be fixed? Please 
let me know what other info would be useful.

Best regards,
--
Sarunas Burdulis
Dartmouth Mathematics
math.dartmouth.edu/~sarunas

· https://useplaintext.email ·
Attachment:
OpenPGP_signature.asc

Description: OpenPGP digital signature
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx