I am wondering if anyone has had experience tuning the following options to get faster failure detection of a storage node: - osd heartbeat interval (default 6s) - osd heartbeat grace (default 20s) I am working with a very small cluster: - 2 storage nodes - 1 to 6 OSDs per storage node - replication of 2 In this configuration, losing a storage node (e.g. power failure) results in an interruption to users of the cluster for 30 or more seconds - due to the length of the heartbeat interval and grace period. I am just wondering why the defaults for these are so high and whether anyone has experience with tuning these to reduce the service interruption on storage node failure. I know there is always a trade-off between faster failure detection times and incorrectly detecting a failure - just wondering how much room there is to reduce these settings. Bart Wensley, Wind River