osd_heartbeat_grace set to 30 but osd's still fail for grace > 20

Bruce.McFarland@xxxxxxxxxxxxxxxx (Bruce McFarland) · Mon, 25 Aug 2014 17:56:13 +0000

Thank you very much for the help. 

I'm moving osd_heartbeat_grace to the global section and trying to figure out what's going on between  the osd's. Since increasing the osd_heartbeat_grace in the [mon] section of ceph.conf on the monitor I still see failures, but now they are 2 seconds > osd_heartbeat_grace. It seems that no matter how much I increase this value osd's are reporting just outside of it. 

I've looked at netstat -s for all of the nodes and will go back and look at the network stat's much closer.

Would it help to put the monitor on a 10G link to the storage nodes? Everything is setup, but we chose to leave the monitor on a 1G link to the storage nodes.

-----Original Message-----
From: Gregory Farnum [mailto:greg@xxxxxxxxxxx] 
Sent: Monday, August 25, 2014 10:50 AM
To: Bruce McFarland
Cc: ceph-users at ceph.com
Subject: Re: osd_heartbeat_grace set to 30 but osd's still fail for grace > 20

Each daemon only reads conf values from its section (or its daemon-type section, or the global section). You'll need to either duplicate the "osd heartbeat grace" value in the [mon] section or put it in the [global] section instead. This is one of the misleading values; sorry about that...

Anyway, as Christian said in your other thread, this isn't your issue ? the OSD heartbeat failures are your issue. You'll need to sort out whatever's going on there.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com

On Mon, Aug 25, 2014 at 10:45 AM, Bruce McFarland <Bruce.McFarland at taec.toshiba.com> wrote:
> That's something that was been puzzling to me. The monitor ceph.conf is set to 35, but it's runtime config reports 20. I've restarted it after initial creation to try and get it to reload the ceph.conf settings, but it stays's at 20.
>
> [root at ceph-mon01 ceph]# ceph --admin-daemon /var/run/ceph/ceph-mon.ceph-mon01.asok config show | grep osd_heartbeat_grace
>   "osd_heartbeat_grace": "20",
> [root at ceph-mon01 ceph]#
>
> [root at ceph-mon01 ceph]# cat ceph.conf
> [global]
> auth_service_required = cephx
> filestore_xattr_use_omap = true
> auth_client_required = cephx
> auth_cluster_required = cephx
> mon_host = 209.243.160.84
> mon_initial_members = ceph-mon01
> fsid = 94bbb882-42e4-4a6c-bfda-125790616fcc
>
> osd_pool_default_pg_num = 4096
> osd_pool_default_pgp_num = 4096
>
> osd_pool_default_size = 3  # Write an object 3 times - number of replicas.
> osd_pool_default_min_size = 1 # Allow writing one copy in a degraded state.
>
> [mon]
> mon_osd_min_down_reporters = 2
>
> [osd]
> debug_ms = 1
> debug_osd = 20
> public_network = 209.243.160.0/24
> cluster_network = 10.10.50.0/24
> osd_journal_size = 96000
> osd_heartbeat_grace = 35
>
> [osd.0]
> .
> .
> .
> -----Original Message-----
> From: Gregory Farnum [mailto:greg at inktank.com]
> Sent: Monday, August 25, 2014 10:39 AM
> To: Bruce McFarland
> Cc: ceph-users at ceph.com
> Subject: Re: [ceph-users] osd_heartbeat_grace set to 30 but osd's 
> still fail for grace > 20
>
> On Sat, Aug 23, 2014 at 11:06 PM, Bruce McFarland <Bruce.McFarland at taec.toshiba.com> wrote:
>> I see osd?s being failed for heartbeat reporting > default 
>> osd_heartbeat_grace of 20 but the run time config shows that the 
>> grace is set to 30. Is there another variable for the osd or the mon 
>> I need to set for the non default osd_heartbeat_grace of 30 to take effect?
>
> You need to also set the osd heartbeat grace on the monitors. If I 
> were to guess, the OSDs are actually seeing each other as slow (after
> 30 seconds) and reporting it in, but the monitors have a grace of 20 seconds set so that's what they're using to generate output.
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com