Hi,6 host 16 OSD cluster here, all SATA SSDs. All Ceph daemons version 18.2.2. Host OS is Ubuntu 24.04. Intel X540 10Gb/s interfaces for cluster network. All is fine while using a 1Gb/s switch. When moved to 10Gb/s switch (Netgear XS712T), OSDs, one-by-one start failing heartbeat checks and are marked as 'down' until only 3 or 4 OSDs remain up. By then cluster is unusable (slow ops, PGs inactive).
Here is a sample sequence from the log of one of the OSDs:ceph-osd[23402]: osd.3 77434 heartbeat_check: no reply from 129.170.x.x:6802 osd.13 ever on either front or back
ceph-osd[23402]: log_channel(cluster) log [WRN] : 101 slow requests (by type [ 'delayed' : 101 ] most affected pool [ 'default.rgw.log' : 96 ])
ceph-osd[23402]: log_channel(cluster) log [WRN] : Monitor daemon marked osd.3 down, but it is still running
ceph-osd[23402]: log_channel(cluster) log [DBG] : map e77442 wrongly marked me down at e77441
ceph-osd[23402]: osd.3 77442 start_waiting_for_healthyceph-osd[23402]: osd.3 77434 is_healthy false -- only 0/10 up peers (less than 33%)
ceph-osd[23402]: osd.3 77434 not healthy; waiting to boot OSD service container keeps running, but it is not booted.Has anyone experienced this? Any ideas on what should be fixed? Please let me know what other info would be useful.
Best regards, -- Sarunas Burdulis Dartmouth Mathematics math.dartmouth.edu/~sarunas · https://useplaintext.email ·
Attachment:
OpenPGP_signature.asc
Description: OpenPGP digital signature
_______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx