Re: Monitor repeatedly calling new election

Joao Eduardo Luis <joao@xxxxxxx> · Fri, 3 Feb 2017 09:41:45 +0000

On 02/03/2017 09:16 AM, 许雪寒 wrote:
Hi, everyone.

Recently, when I was doing some stress test, one of the monitors of my ceph cluster was marked down, and all the monitors repeatedly call new election and the I/O can be finished. There were three monitors in my cluster: rg3-ceph36, rg3-ceph40, rg3-ceph45. It was rg3-ceph40 that was always marked down, but it was running.
Here is rg3-ceph40’s monitor log with debug_mon and debug_paxos set to 20/0.

The log provides no clue to the behaviour you're seeing.

Partly because it only shows an election finishing, without showing 
what's causing them in the first place. To ascertain what may be going 
wrong we'd need a bit more context.

However, given your description, I could theorize that one of a few 
things may be happening:

(presuming rg3-ceph40 is the expected leader, mon.0, given the others 
are referred to as mon.1 and mon.2)

1) rg3-ceph40 is overloaded and may not be handling paxos lease acks in 
time, thus assuming the other monitors are dead and triggering an election;

2) rg3-ceph40 is overloaded and may not be sending out paxos leases in 
time, thus the others assume rg3-ceph40 is dead and trigger an election;

3) rg3-ceph40's connection to the other monitors may be faulty and paxos 
leases are lost in the wire, thus 1) or 2).

4) rg3-ceph40's may be getting stuck reading/writing to the store 
(typically a symptom of an oversized store.db), and thus 1) or 2).

5) rg3-ceph40 or some other monitor may have a clock skew, resulting in 
paxos leases to be marked in the past at time of sending/receiving; this 
could lead to a presumed expired lease, and - given no new lease has 
been received - presuming the leader is dead, triggering an election.

These are the most common cases for the behaviour you described. You may 
want to set 'debug ms = 1' to include message debug output in the mix.

In any case, you are running a *really* old hammer version, and since 
then many fixes were added. You should try upgrading to a more recent 
version and see if your issues go away.

  -Joao
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com