Re: Monitor repeatedly calling new election

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 02/03/2017 09:16 AM, 许雪寒 wrote:
Hi, everyone.

Recently, when I was doing some stress test, one of the monitors of my ceph cluster was marked down, and all the monitors repeatedly call new election and the I/O can be finished. There were three monitors in my cluster: rg3-ceph36, rg3-ceph40, rg3-ceph45. It was rg3-ceph40 that was always marked down, but it was running.
Here is rg3-ceph40’s monitor log with debug_mon and debug_paxos set to 20/0.

The log provides no clue to the behaviour you're seeing.

Partly because it only shows an election finishing, without showing what's causing them in the first place. To ascertain what may be going wrong we'd need a bit more context.

However, given your description, I could theorize that one of a few things may be happening:

(presuming rg3-ceph40 is the expected leader, mon.0, given the others are referred to as mon.1 and mon.2)

1) rg3-ceph40 is overloaded and may not be handling paxos lease acks in time, thus assuming the other monitors are dead and triggering an election;

2) rg3-ceph40 is overloaded and may not be sending out paxos leases in time, thus the others assume rg3-ceph40 is dead and trigger an election;

3) rg3-ceph40's connection to the other monitors may be faulty and paxos leases are lost in the wire, thus 1) or 2).

4) rg3-ceph40's may be getting stuck reading/writing to the store (typically a symptom of an oversized store.db), and thus 1) or 2).

5) rg3-ceph40 or some other monitor may have a clock skew, resulting in paxos leases to be marked in the past at time of sending/receiving; this could lead to a presumed expired lease, and - given no new lease has been received - presuming the leader is dead, triggering an election.

These are the most common cases for the behaviour you described. You may want to set 'debug ms = 1' to include message debug output in the mix.

In any case, you are running a *really* old hammer version, and since then many fixes were added. You should try upgrading to a more recent version and see if your issues go away.

  -Joao
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux