osd_heartbeat_grace set to 30 but osd's still fail for grace > 20

Bruce.McFarland@xxxxxxxxxxxxxxxx (Bruce McFarland) · Mon, 25 Aug 2014 18:18:57 +0000

After looking a little closer now that I have a better understanding of osd_heartbeat_grace for the monitor all the osd failures are coming from 1 node in the cluster. Yes your hunch was correct and that node had stale in the iptables. After disabling iptables the osd "flapping" has stopped.  

Now I'm going to bring the osd_heartbeat_grace value back down incrementally and see if the cluster runs without reporting issues with the default.

Thank you very much for your help.

I have some default pool questions concerning cluster bring up:
I have 90 osd's (single 4TB HDD/osd with 96GB journal that is a partition on a SSD raid0) 30 osd's per storage node.
I have the default page/placement group info in the [global] section of ceph.conf:
osd_pool_default_pg_num = 4096
osd_pool_default_pgp_num = 4096

When I bring up a cluster I'm running out of the default pools 0-data, 1-metadata, and 2-rbd and getting error msgs for not enough pages/osd. Since osd's require between 20 and 32 pages each as soon as I've brought up the first storage node I need a minimum of 600 pages, but the system comes up with the defaults of 64/default pool. After creation of each nodes osd's I increased the default pool sizes with ceph osd pool set <pool> pg_num and pgp_num for each of the default pools. Do I need to increase all 3 pools? Is there a ceph.conf setting that handles this startup issue? 

- whats' the "best practices" way to handle bringing up more osd's than the default pool page settings can handle?

-----Original Message-----
From: Gregory Farnum [mailto:greg@xxxxxxxxxxx] 
Sent: Monday, August 25, 2014 11:01 AM
To: Bruce McFarland
Cc: ceph-users at ceph.com
Subject: Re: osd_heartbeat_grace set to 30 but osd's still fail for grace > 20

On Mon, Aug 25, 2014 at 10:56 AM, Bruce McFarland <Bruce.McFarland at taec.toshiba.com> wrote:
> Thank you very much for the help.
>
> I'm moving osd_heartbeat_grace to the global section and trying to figure out what's going on between  the osd's. Since increasing the osd_heartbeat_grace in the [mon] section of ceph.conf on the monitor I still see failures, but now they are 2 seconds > osd_heartbeat_grace. It seems that no matter how much I increase this value osd's are reporting just outside of it.
>
> I've looked at netstat -s for all of the nodes and will go back and look at the network stat's much closer.
>
> Would it help to put the monitor on a 10G link to the storage nodes? Everything is setup, but we chose to leave the monitor on a 1G link to the storage nodes.

No. They're being marked down because they aren't heartbeating the OSDs, and those OSDs are reporting the failures to the monitor (whose connection is apparently working fine). The most likely guess without more data is that you've got firewall rules set up blocking the ports the OSDs are using to send their heartbeats...but it could be many things in your network stack or your cpu scheduler or whatever.