Re: Mon time to form quorum

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



ceph-devel may be a better place for this... This looks related to the
recent change that allows the mon to bind to a different address from the
advertised address. Notice that the config below has different addresses
for "public addr" and "cluster addr". Could this be causing paxos to take
some time to settle?

Another datapoint is that if there is only one mon (instead of three), the
quorum only takes about 10 seconds to establish instead of 60s.


____________________________________________________
From: Travis Nielsen <travis.nielsen@xxxxxxxxxxx>
Date: Tuesday, August 8, 2017 at 10:49 AM
To: "ceph-users@xxxxxxxxxxxxxx" <ceph-users@xxxxxxxxxxxxxx>
Subject: Mon time to form quorum



At cluster creation I'm seeing that the mons are taking a while time to
form quorum. It seems like I'm hitting a timeout of 60s somewhere. Am I
missing a config setting that would help paxos establish quorum sooner?
When initializing with the monmap I would have expected the mons to
initialize very quickly.

The scenario is:

* Luminous RC 2
* The mons are initialized with a monmap
* Running in Kubernetes (Rook)

The symptoms are:

* When all three mons start in parallel, they appear to determine their
rank immediately. I assume this means they establish communication. A log
message is seen such as this in each of the mon logs:
  - 2017-08-08 17:03:16.383599 7f8da7c85f40  0
mon.rook-ceph-mon1@-1(probing) e0  my rank is now 0 (was ­1)

* Now paxos enters a loop that times out every two seconds and lasts about
60s, trying to probe the other monitors. During this wait, I am able to
curl the mon endpoints successfully.
  - 2017-08-08 17:03:17.345877 7f02b779af40 10
mon.rook-ceph-mon0@1(probing) e0 probing other monitors
  - 2017-08-08 17:03:19.346032 7f02ae568700  4
mon.rook-ceph-mon0@1(probing) e0 probe_timeout 0x55c93678bb00

* After about 60 seconds the probe succeeds and the mons start responding
  - 2017-08-08 17:04:17.356928 7f02ae568700 10
mon.rook-ceph-mon0@1(probing) e0 probing other monitors
  - 2017-08-08 17:04:17.366587 7f02a855c700 10
mon.rook-ceph-mon0@1(probing) e0 ms_verify_authorizer 10.0.0.254:6790/0
mon protocol 2


The relevant settings in the config are:
mon initial members  = rook-ceph-mon0 rook-ceph-mon1 rook-ceph-mon2
mon host                      =
10.0.0.24:6790,10.0.0.163:6790,10.0.0.139:6790
public addr                   = 10.0.0.24
cluster addr                  = 172.17.0.5

The full log for this mon at debug log level 20 can be found here:
https://gist.github.com/travisn/2c2641a6b80a7479b3b22accb41a5193

Any ideas?

Thanks,
Travis

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux