Tuning osd hearbeat interval and grace period

Barton.Wensley@xxxxxxxxxxxxx (Wensley, Barton) · Wed, 24 Sep 2014 18:34:35 +0000

I am wondering if anyone has had experience tuning the following options to get faster failure detection of a storage node:
- osd heartbeat interval (default 6s)
- osd heartbeat grace (default 20s)

I am working with a very small cluster:
- 2 storage nodes
- 1 to 6 OSDs per storage node
- replication of 2

In this configuration, losing a storage node (e.g. power failure) results in an interruption to users of the cluster for 30 or more seconds - due to the length of the heartbeat interval and grace period. I am just wondering why the defaults for these are so high and whether anyone has experience with tuning these to reduce the service interruption on storage node failure. I know there is always a trade-off between faster failure detection times and incorrectly detecting a failure - just wondering how much room there is to reduce these settings.

Bart Wensley, Wind River