Re: Round-trip time for monitors

- - <francois.petit@xxxxxxxxxxxxxxxx> · Wed, 1 Jul 2015 14:48:32 +0200 (CEST)

Greg, Wido.
Thank you to both of you.

Every DC has a local NTP server, and the NTP config for all servers designate their local NTP server as preferred, and then the remote one.

Thank you for the pointer to 'mon clock drift allowed' setting. The default is so 50ms.
Funny enough, the 1 second leap which occurred last night made a "health HEALTH_WARN clock skew detected on mon.<remote>" warning to pop up.  Messages every 5 minutes in the log:
------------
 2015-07-01 08:52:05.109124 7f6ed8a37700 0 log [WRN] : mon.2 <remote>:6789/0 clock skew 0.984032s > max 0.05s
------------
Something weird in our ntp service... I restarted ntpd and ceph on the remote node, and no more warning.

Then I played with the mon_clock_drift_allowed setting, and set it to a very low value (cluster not yet in production!) on every monitor:
[root@<host> ~]# ceph tell mon.$(hostname -s) injectargs '--mon_clock_drift_allowed 0.002'
injectargs:mon_clock_drift_allowed = '0.002' 
[root@<host> ~]# 

I increased verbosity too:
[root@<local0> ~]# ceph daemon mon.$(hostname -s) config show | egrep 'drift|"debug_[mp][^d]'
 “debug_ms”: "10\/5",
 “debug_mon”: "10\/5",
 “debug_monc”: "0\/10",
 “debug_paxos”: "10\/5",
 “debug_perfcounter”: "1\/5",
 “mon_clock_drift_allowed”: "0.002",
 “mon_clock_drift_warn_backoff”: “5”,
 “paxos_max_join_drift”: “10”,
[root@<local0> ~]# 

ceph version 0.80.9 (b5a67f0e1d15385bc0d60a6da6e7fc810bde6047)

I checked the log on the leader monitor, and bingo:
----------------------------------------------------
 5 2015-07-01 11:00:55.086047 7f6ed9238700 10 mon.<local0>@0(leader) e4 timecheck start timecheck epoch 516 round 41
 6 2015-07-01 11:00:55.086055 7f6ed9238700 10 mon.<local0>@0(leader) e4 timecheck send time_check( ping e 516 r 41 ) v1 to mon.1 <local1>:6789/0
 7 2015-07-01 11:00:55.086067 7f6ed9238700 1 -- <local0>:6789/0 --> mon.1 <local1>:6789/0 -- time_check( ping e 516 r 41 ) v1 -- ?+0 0x4f6d340
 8 2015-07-01 11:00:55.086083 7f6ed9238700 10 mon.<local0>@0(leader) e4 timecheck send time_check( ping e 516 r 41 ) v1 to mon.2 <remote>:6789/0
 9 2015-07-01 11:00:55.086090 7f6ed9238700 1 -- <local0>:6789/0 --> mon.2 <remote>:6789/0 -- time_check( ping e 516 r 41 ) v1 -- ?+0 0x4f6e540

 19 2015-07-01 11:00:55.087036 7f6ed8a37700 1 -- <local0>:6789/0 <== mon.1 <local1>:6789/0 787010 ==== time_check( pong e 516 r 41 ts 2015-07-01 11:00:55.086662 ) v1 ==== 36+0+0 (1646653918 0 0) 0x4f6d340 con 0x22d18c0

 22 2015-07-01 11:00:55.087060 7f6ed8a37700 10 mon.<local0>@0(leader) e4 handle_timecheck_leader from mon.1 <local1>:6789/0 ts 2015-07-01 11:00:55. 086662 delta -0.000393867 skew_bound 0 latency 0.001003

 31 2015-07-01 11:00:55.096984 7f6ed8a37700 1 -- <local0>:6789/0 <== mon.2 <remote>:6789/0 7687 ==== time_check( pong e 516 r 41 ts 2015-07-01 11:0 0:55.091233 ) v1 ==== 36+0+0 (3185525643 0 0) 0x4f6e540 con 0x22d1600

 35 2015-07-01 11:00:55.097028 7f6ed8a37700 10 mon.<local0>@0(leader) e4 handle_timecheck_leader from mon.2 <remote>:6789/0 ts 2015-07-01 11:00:55.09 1233 delta -0.00579095 skew_bound 0 latency 0.010942

 36 2015-07-01 11:00:55.097051 7f6ed8a37700 10 mon.<local0>@0(leader) e4 handle_timecheck_leader got pongs from everybody (3 total)
-------------------------------------------------
The check is done every 10 minutes. 
This has been running for 3 hours, and the cluster stays in state 'HEALTH_OK'.

My conclusion is that Ceph is clever enough to consider separately skew and latency, and that it can support latencies much higher than the allowed skew.

So I think that a 30ms RTT is acceptable. Unless someone disagrees...?

Wido, I got your point about the leader monitor switching to the remote site. 
I read these pages https://ceph.com/community/monitors-and-paxos-a-chat-with-joao (last paragraph), and http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-mon/#understanding-mon-status
I understand that the monitor with the lowest ip:port is likely to become the Leader, just like it is now in my cluster ('ceph mon_status --format json-pretty'), and fortunately the Leader is on the local site. So this should be pretty stable. 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com