osd_heartbeat_grace set to 30 but osd's still fail for grace > 20

greg@xxxxxxxxxxx (Gregory Farnum) · Mon, 25 Aug 2014 11:01:02 -0700

On Mon, Aug 25, 2014 at 10:56 AM, Bruce McFarland
<Bruce.McFarland at taec.toshiba.com> wrote:
> Thank you very much for the help.
>
> I'm moving osd_heartbeat_grace to the global section and trying to figure out what's going on between  the osd's. Since increasing the osd_heartbeat_grace in the [mon] section of ceph.conf on the monitor I still see failures, but now they are 2 seconds > osd_heartbeat_grace. It seems that no matter how much I increase this value osd's are reporting just outside of it.
>
> I've looked at netstat -s for all of the nodes and will go back and look at the network stat's much closer.
>
> Would it help to put the monitor on a 10G link to the storage nodes? Everything is setup, but we chose to leave the monitor on a 1G link to the storage nodes.

No. They're being marked down because they aren't heartbeating the
OSDs, and those OSDs are reporting the failures to the monitor (whose
connection is apparently working fine). The most likely guess without
more data is that you've got firewall rules set up blocking the ports
the OSDs are using to send their heartbeats...but it could be many
things in your network stack or your cpu scheduler or whatever.