Re: UGRENT: add mon failed and ceph monitor refreshlog crazily

"minchen" <minchen@xxxxxxxxxxxxxxx> · Fri, 13 Feb 2015 23:47:19 +0800

ceph version is 0.80.4

when add mon.f to {b,c,d,e}, mon.e is out quorum, 
mon.b, mon.c, mon.d are electing in cycle(restart a new election after leader win) .
so,  i think current 4 monitors can exchange messages to each other successfully.

In addtion, mon.f is stuck at state synchronizing, and geting data from mon.e after probing. 

When I stop mon.f , mon.e goes back to quorum after a while, then ceph cluster becomes HEALTH_OK.
But, all mon.b, mon.c, mon.d and mon.e logs are refreshing paxos acitive or updating messages many times per second, 
and paxos commit seq is increasing fastly.  while the same situation not occurs in cluster of ‍ceph-0.80.7 

If you are still confused, maybe I should reproduct this in our cluster, and get complete mon logs ...

------------------ Original ------------------
From:  "sweil";<sweil@xxxxxxxxxx>;
Date:  Fri, Feb 13, 2015 10:28 PM
To:  "minchen"<minchen@xxxxxxxxxxxxxxx>; 
Cc:  "ceph-users"<ceph-users@xxxxxxxxxxxxxx>; "joao"<joao@xxxxxxxxxx>; 
Subject:  Re: [ceph-users] UGRENT: add mon failed and ceph monitor refreshlog crazily

What version is this?

It's hard to tell from the logs below, but it looks like there might be a 
connectivity problem?  Is it able to exchange messages with the other 
monitors?

Perhaps more improtantly, though, if you simply stop the new mon.f, can 
mon.e join?  What is in its log?

sage

On Fri, 13 Feb 2015, minchen wrote:

> Hi ,  
>   all developers and users
> when i add a new mon to current mon cluter, failed with 2 mon out of quorum.
> 
> 
> there are 5 mons in our ceph cluster: 
> epoch 7
> fsid 0dfd2bd5-1896-4712-916b-ec02dcc7b049
> last_changed 2015-02-13 09:11:45.758839
> created 0.000000
> 0: 10.117.16.17:6789/0 mon.b
> 1: 10.118.32.7:6789/0 mon.cHEALTH_WARN 2 mons down, quorum 0,1,2 b,c,d
> mon.e (rank 3) addr 10.122.0.9:6789/0 is down (out of quorum)
> mon.f (rank 4) addr 10.122.48.11:6789/0 is down (out of quorum)
> 2: 10.119.16.11:6789/0 mon.d
> 3: 10.122.0.9:6789/0 mon.e
> 4: 10.122.48.11:6789/0 mon.f
> 
> 
> mon.f is newly added to montior cluster, but when starting mon.f, 
> it caused both  mon.e and mon.f out of quorum:
> HEALTH_WARN 2 mons down, quorum 0,1,2 b,c,d
> mon.e (rank 3) addr 10.122.0.9:6789/0 is down (out of quorum)
> mon.f (rank 4) addr 10.122.48.11:6789/0 is down (out of quorum)
> 
> 
> mon.b ,mon.c, mon.d, log refresh crazily as following:
> Feb 13 09:37:34 root ceph-mon: 2015-02-13 09:37:34.063628 7f7b64e14700  1 mon.b@0(leader).paxos(paxos active c 11818589..11819234) is_readable now=2015-02-13 09:37:34.063629 lease_expire=2015-02-13 09:37:38.205219 has v0 lc 11819234
> Feb 13 09:37:34 root ceph-mon: 2015-02-13 09:37:34.090647 7f7b64e14700  1 mon.b@0(leader).paxos(paxos active c 11818589..11819234) is_readable now=2015-02-13 09:37:34.090648 lease_expire=2015-02-13 09:37:38.205219 has v0 lc 11819234
> Feb 13 09:37:34 root ceph-mon: 2015-02-13 09:37:34.090661 7f7b64e14700  1 mon.b@0(leader).paxos(paxos active c 11818589..11819234) is_readable now=2015-02-13 09:37:34.090662 lease_expire=2015-02-13 09:37:38.205219 has v0 lc 11819234
> ......
> 
> 
> and mon.f log :
> 
> 
> Feb 13 09:16:26 root ceph-mon: 2015-02-13 09:16:26.526676 7f3931dfd7c0  0 ceph version 0.80.4 (7c241cfaa6c8c068bc9da8578ca00b9f4fc7567f), process ceph-mon, pid 30639
> Feb 13 09:16:26 root ceph-mon: 2015-02-13 09:16:26.607412 7f3931dfd7c0  0 mon.f does not exist in monmap, will attempt to join an existing cluster
> Feb 13 09:16:26 root ceph-mon: 2015-02-13 09:16:26.609838 7f3931dfd7c0  0 starting mon.f rank -1 at 10.122.48.11:6789/0 mon_data /osd/ceph/mon fsid 0dfd2bd5-1896-4712-916b-ec02dcc7b049
> Feb 13 09:16:26 root ceph-mon: 2015-02-13 09:16:26.610076 7f3931dfd7c0  1 mon.f@-1(probing) e0 preinit fsid 0dfd2bd5-1896-4712-916b-ec02dcc7b049
> Feb 13 09:16:26 root ceph-mon: 2015-02-13 09:16:26.636499 7f392a504700  0 -- 10.122.48.11:6789/0 >> 10.119.16.11:6789/0 pipe(0x7f3934ebfb80 sd=26 :6789 s=0 pgs=0 cs=0 l=0 c=0x7f3934ea9ce0).accept connect_seq 0 vs existing 0 state wait
> Feb 13 09:16:26 root ceph-mon: 2015-02-13 09:16:26.636797 7f392a201700  0 -- 10.122.48.11:6789/0 >> 10.122.0.9:6789/0 pipe(0x7f3934ec0800 sd=29 :6789 s=0 pgs=0 cs=0 l=0 c=0x7f3934eaa940).accept connect_seq 0 vs existing 0 state wait
> Feb 13 09:16:26 root ceph-mon: 2015-02-13 09:16:26.636968 7f392a403700  0 -- 10.122.48.11:6789/0 >> 10.118.32.7:6789/0 pipe(0x7f3934ec0080 sd=27 :6789 s=0 pgs=0 cs=0 l=0 c=0x7f3934ea9e40).accept connect_seq 0 vs existing 0 state wait
> Feb 13 09:16:26 root ceph-mon: 2015-02-13 09:16:26.637037 7f392a302700  0 -- 10.122.48.11:6789/0 >> 10.117.16.17:6789/0 pipe(0x7f3934ebfe00 sd=28 :6789 s=0 pgs=0 cs=0 l=0 c=0x7f3934eaa260).accept connect_seq 0 vs existing 0 state wait
> Feb 13 09:16:26 root ceph-mon: 2015-02-13 09:16:26.638854 7f392c00a700  0 mon.f@-1(probing) e7  my rank is now 4 (was -1)
> Feb 13 09:16:26 root ceph-mon: 2015-02-13 09:16:26.639365 7f392c00a700  1 mon.f@4(synchronizing) e7 sync_obtain_latest_monmap
> Feb 13 09:16:26 root ceph-mon: 2015-02-13 09:16:26.639494 7f392b008700  0 -- 10.122.48.11:6789/0 >> 10.122.0.9:6789/0 pipe(0x7f3934ec0580 sd=17 :6789 s=0 pgs=0 cs=0 l=0 c=0x7f3934eaa680).accept connect_seq 2 vs existing 0 state connecting
> Feb 13 09:16:26 root ceph-mon: 2015-02-13 09:16:26.639513 7f392b008700  0 -- 10.122.48.11:6789/0 >> 10.122.0.9:6789/0 pipe(0x7f3934ec0580 sd=17 :6789 s=0 pgs=0 cs=0 l=0 c=0x7f3934eaa680).accept we reset (peer sent cseq 2, 0x7f3934ebf400.cseq = 0), sending RESETSESSION
> ......
> Feb 13 09:16:26 root ceph-mon: 2015-02-13 09:16:26.643159 7f392af07700  0 -- 10.122.48.11:6789/0 >> 10.119.16.11:6789/0 pipe(0x7f3934ec1700 sd=28 :6789 s=0 pgs=0 cs=0 l=0 c=0x7f3934eab2e0).accept connect_seq 0 vs existing 0 state wait
> 
> 
> 
> 
> Feb 13 09:16:26 root ceph-mon: 2015-02-13 09:16:26.637037 7f392a302700  0 -- 10.122.48.11:6789/0 >> 10.117.16.17:6789/0 pipe(0x7f3934ebfe00 sd=28 :6789 s=0 pgs=0 cs=0 l=0 c=0x7f3934eaa260).accept connect_seq 0 vs existing 0 state wait
> Feb 13 09:16:26 root ceph-mon: 2015-02-13 09:16:26.638854 7f392c00a700  0 mon.f@-1(probing) e7  my rank is now 4 (was -1)
> Feb 13 09:16:26 root ceph-mon: 2015-02-13 09:16:26.639365 7f392c00a700  1 mon.f@4(synchronizing) e7 sync_obtain_latest_monmap
> Feb 13 09:16:26 root ceph-mon: 2015-02-13 09:16:26.639494 7f392b008700  0 -- 10.122.48.11:6789/0 >> 10.122.0.9:6789/0 pipe(0x7f3934ec0580 sd=17 :6789 s=0 pgs=0 cs=0 l=0 c=0x7f3934eaa680).accept connect_seq 2 vs existing 0 state connecting
> Feb 13 09:16:26 root ceph-mon: 2015-02-13 09:16:26.639513 7f392b008700  0 -- 10.122.48.11:6789/0 >> 10.122.0.9:6789/0 pipe(0x7f3934ec0580 sd=17 :6789 s=0 pgs=0 cs=0 l=0 c=0x7f3934eaa680).accept we reset (peer sent cseq 2, 0x7f3934ebf400.cseq = 0), sending RESETSESSION
> ......
> Feb 13 09:16:26 root ceph-mon: 2015-02-13 09:16:26.643273 7f392c00a700  1 mon.f@4(synchronizing) e7 sync_obtain_latest_monmap obtained monmap e7
> Feb 13 09:17:26 root ceph-mon: 2015-02-13 09:17:26.611550 7f392c80b700  0 mon.f@4(synchronizing).data_health(0) update_stats avail 99% total 911815680 used 33132 avail 911782548
> Feb 13 09:17:26 root ceph-mon: 2015-02-13 09:17:26.708961 7f392c00a700  1 mon.f@4(synchronizing) e7 sync_obtain_latest_monmap
> Feb 13 09:17:26 root ceph-mon: 2015-02-13 09:17:26.709063 7f392c00a700  1 mon.f@4(synchronizing) e7 sync_obtain_latest_monmap obtained monmap e7
> 
> 
> 
> 
> someone can help? thank you!
> 
> 
> minchen
> 
> 
> 
> 
> 
> 
> 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com