Re: Random heartbeat_map timed out

Igor Fedotov <ifedotov@xxxxxxx> · Wed, 23 Dec 2020 18:34:29 +0300

Hi Seena,

one of the frequent cause for such a timeout is slow RocksDB 
operationing. Which in turn might be caused by bluefs_buffered_io set to 
false and/or DB "fragmentation" after massive data removal.

Hence the potential workarounds are adjusting bluefs_buffered_io and 
manual RocksDB compaction.

This topic has been discussed in this mailing list and relevant tickets 
multiple times.

Thanks,

Igor

On 12/23/2020 3:24 PM, Seena Fallah wrote:
Hi,

All my OSD nodes in the SSD tier are getting heartbeat_map timed out
randomly and I don't find why!

7ff2ed3f2700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread
0x7ff2c8943700' had timed out after 15

It occurs many times in a day and causes my cluster to be down.

Is there any way to find why the OSDs get time out? I don't think it's
because of heartbeat and there is an issue with OSD that came to the
heartbeat to be timeout because ODSs don't suicide and OSDs get too slow
and cause downtime on RBD and S3 gateway because the queue is full!

Thanks.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx