On Mon, Aug 25, 2014 at 10:56 AM, Bruce McFarland <Bruce.McFarland at taec.toshiba.com> wrote: > Thank you very much for the help. > > I'm moving osd_heartbeat_grace to the global section and trying to figure out what's going on between the osd's. Since increasing the osd_heartbeat_grace in the [mon] section of ceph.conf on the monitor I still see failures, but now they are 2 seconds > osd_heartbeat_grace. It seems that no matter how much I increase this value osd's are reporting just outside of it. > > I've looked at netstat -s for all of the nodes and will go back and look at the network stat's much closer. > > Would it help to put the monitor on a 10G link to the storage nodes? Everything is setup, but we chose to leave the monitor on a 1G link to the storage nodes. No. They're being marked down because they aren't heartbeating the OSDs, and those OSDs are reporting the failures to the monitor (whose connection is apparently working fine). The most likely guess without more data is that you've got firewall rules set up blocking the ports the OSDs are using to send their heartbeats...but it could be many things in your network stack or your cpu scheduler or whatever.