Re: UGRENT: add mon failed and ceph monitor refreshlog crazily

Sage Weil <sweil@xxxxxxxxxx> · Fri, 13 Feb 2015 08:03:36 -0800 (PST)

It sounds a bit like the extra load on mon.e from the synchronization is 
preventing it from joining the quorum?  If you stop and restart mon.f it 
should pick a different mon to pull from, though.  Perhaps see if that 
makes a different mon drop out?  Then at least we'd understand what is 
going on...

sage

On Fri, 13 Feb 2015, minchen wrote:

> 
> ceph version is 0.80.4
> 
> when add mon.f to {b,c,d,e}, mon.e is out quorum, 
> mon.b, mon.c, mon.d are electing in cycle(restart a new election after
> leader win) .
> so,  i think current 4 monitors can exchange messages to each other
> successfully.
> 
> In addtion, mon.f is stuck at state synchronizing, and geting data from
> mon.e after probing. 
> 
> When I stop mon.f , mon.e goes back to quorum after a while, then ceph
> cluster becomes HEALTH_OK.
> But, all mon.b, mon.c, mon.d and mon.e logs are refreshing paxos acitive or
> updating messages many times per second, 
> and paxos commit seq is increasing fastly.  while the same situation not
> occurs in cluster of ?ceph-0.80.7 
> 
> If you are still confused, maybe I should reproduct this in our cluster, and
> get complete mon logs ...
> 
> ------------------ Original ------------------
> From:  "sweil";<sweil@xxxxxxxxxx>;
> Date:  Fri, Feb 13, 2015 10:28 PM
> To:  "minchen"<minchen@xxxxxxxxxxxxxxx>;
> Cc:  "ceph-users"<ceph-users@xxxxxxxxxxxxxx>; "joao"<joao@xxxxxxxxxx>;
> Subject:  Re:  UGRENT: add mon failed and ceph monitor
> refreshlog crazily
> 
> What version is this?
> 
> It's hard to tell from the logs below, but it looks like there might be a
> connectivity problem?  Is it able to exchange messages with the other
> monitors?
> 
> Perhaps more improtantly, though, if you simply stop the new mon.f, can
> mon.e join?  What is in its log?
> 
> sage
> 
> 
> On Fri, 13 Feb 2015, minchen wrote:
> 
> > Hi , 
> >   all developers and users
> > when i add a new mon to current mon cluter, failed with 2 mon out of
> quorum.
> >
> >
> > there are 5 mons in our ceph cluster:
> > epoch 7
> > fsid 0dfd2bd5-1896-4712-916b-ec02dcc7b049
> > last_changed 2015-02-13 09:11:45.758839
> > created 0.000000
> > 0: 10.117.16.17:6789/0 mon.b
> > 1: 10.118.32.7:6789/0 mon.cHEALTH_WARN 2 mons down, quorum 0,1,2 b,c,d
> > mon.e (rank 3) addr 10.122.0.9:6789/0 is down (out of quorum)
> > mon.f (rank 4) addr 10.122.48.11:6789/0 is down (out of quorum)
> > 2: 10.119.16.11:6789/0 mon.d
> > 3: 10.122.0.9:6789/0 mon.e
> > 4: 10.122.48.11:6789/0 mon.f
> >
> >
> > mon.f is newly added to montior cluster, but when starting mon.f,
> > it caused both  mon.e and mon.f out of quorum:
> > HEALTH_WARN 2 mons down, quorum 0,1,2 b,c,d
> > mon.e (rank 3) addr 10.122.0.9:6789/0 is down (out of quorum)
> > mon.f (rank 4) addr 10.122.48.11:6789/0 is down (out of quorum)
> >
> >
> > mon.b ,mon.c, mon.d, log refresh crazily as following:
> > Feb 13 09:37:34 root ceph-mon: 2015-02-13 09:37:34.063628 7f7b64e14700  1
> mon.b@0(leader).paxos(paxos active c 11818589..11819234) is_readable
> now=2015-02-13 09:37:34.063629 lease_expire=2015-02-13 09:37:38.205219 has
> v0 lc 11819234
> > Feb 13 09:37:34 root ceph-mon: 2015-02-13 09:37:34.090647 7f7b64e14700  1
> mon.b@0(leader).paxos(paxos active c 11818589..11819234) is_readable
> now=2015-02-13 09:37:34.090648 lease_expire=2015-02-13 09:37:38.205219 has
> v0 lc 11819234
> > Feb 13 09:37:34 root ceph-mon: 2015-02-13 09:37:34.090661 7f7b64e14700  1
> mon.b@0(leader).paxos(paxos active c 11818589..11819234) is_readable
> now=2015-02-13 09:37:34.090662 lease_expire=2015-02-13 09:37:38.205219 has
> v0 lc 11819234
> > ......
> >
> >
> > and mon.f log :
> >
> >
> > Feb 13 09:16:26 root ceph-mon: 2015-02-13 09:16:26.526676 7f3931dfd7c0  0
> ceph version 0.80.4 (7c241cfaa6c8c068bc9da8578ca00b9f4fc7567f), process
> ceph-mon, pid 30639
> > Feb 13 09:16:26 root ceph-mon: 2015-02-13 09:16:26.607412 7f3931dfd7c0  0
> mon.f does not exist in monmap, will attempt to join an existing cluster
> > Feb 13 09:16:26 root ceph-mon: 2015-02-13 09:16:26.609838 7f3931dfd7c0  0
> starting mon.f rank -1 at 10.122.48.11:6789/0 mon_data /osd/ceph/mon fsid
> 0dfd2bd5-1896-4712-916b-ec02dcc7b049
> > Feb 13 09:16:26 root ceph-mon: 2015-02-13 09:16:26.610076 7f3931dfd7c0  1
> mon.f@-1(probing) e0 preinit fsid 0dfd2bd5-1896-4712-916b-ec02dcc7b049
> > Feb 13 09:16:26 root ceph-mon: 2015-02-13 09:16:26.636499 7f392a504700  0
> -- 10.122.48.11:6789/0 >> 10.119.16.11:6789/0 pipe(0x7f3934ebfb80 sd=26
> :6789 s=0 pgs=0 cs=0 l=0 c=0x7f3934ea9ce0).accept connect_seq 0 vs existing
> 0 state wait
> > Feb 13 09:16:26 root ceph-mon: 2015-02-13 09:16:26.636797 7f392a201700  0
> -- 10.122.48.11:6789/0 >> 10.122.0.9:6789/0 pipe(0x7f3934ec0800 sd=29 :6789
> s=0 pgs=0 cs=0 l=0 c=0x7f3934eaa940).accept connect_seq 0 vs existing 0
> state wait
> > Feb 13 09:16:26 root ceph-mon: 2015-02-13 09:16:26.636968 7f392a403700  0
> -- 10.122.48.11:6789/0 >> 10.118.32.7:6789/0 pipe(0x7f3934ec0080 sd=27 :6789
> s=0 pgs=0 cs=0 l=0 c=0x7f3934ea9e40).accept connect_seq 0 vs existing 0
> state wait
> > Feb 13 09:16:26 root ceph-mon: 2015-02-13 09:16:26.637037 7f392a302700  0
> -- 10.122.48.11:6789/0 >> 10.117.16.17:6789/0 pipe(0x7f3934ebfe00 sd=28
> :6789 s=0 pgs=0 cs=0 l=0 c=0x7f3934eaa260).accept connect_seq 0 vs existing
> 0 state wait
> > Feb 13 09:16:26 root ceph-mon: 2015-02-13 09:16:26.638854 7f392c00a700  0
> mon.f@-1(probing) e7  my rank is now 4 (was -1)
> > Feb 13 09:16:26 root ceph-mon: 2015-02-13 09:16:26.639365 7f392c00a700  1
> mon.f@4(synchronizing) e7 sync_obtain_latest_monmap
> > Feb 13 09:16:26 root ceph-mon: 2015-02-13 09:16:26.639494 7f392b008700  0
> -- 10.122.48.11:6789/0 >> 10.122.0.9:6789/0 pipe(0x7f3934ec0580 sd=17 :6789
> s=0 pgs=0 cs=0 l=0 c=0x7f3934eaa680).accept connect_seq 2 vs existing 0
> state connecting
> > Feb 13 09:16:26 root ceph-mon: 2015-02-13 09:16:26.639513 7f392b008700  0
> -- 10.122.48.11:6789/0 >> 10.122.0.9:6789/0 pipe(0x7f3934ec0580 sd=17 :6789
> s=0 pgs=0 cs=0 l=0 c=0x7f3934eaa680).accept we reset (peer sent cseq 2,
> 0x7f3934ebf400.cseq = 0), sending RESETSESSION
> > ......
> > Feb 13 09:16:26 root ceph-mon: 2015-02-13 09:16:26.643159 7f392af07700  0
> -- 10.122.48.11:6789/0 >> 10.119.16.11:6789/0 pipe(0x7f3934ec1700 sd=28
> :6789 s=0 pgs=0 cs=0 l=0 c=0x7f3934eab2e0).accept connect_seq 0 vs existing
> 0 state wait
> >
> >
> >
> >
> > Feb 13 09:16:26 root ceph-mon: 2015-02-13 09:16:26.637037 7f392a302700  0
> -- 10.122.48.11:6789/0 >> 10.117.16.17:6789/0 pipe(0x7f3934ebfe00 sd=28
> :6789 s=0 pgs=0 cs=0 l=0 c=0x7f3934eaa260).accept connect_seq 0 vs existing
> 0 state wait
> > Feb 13 09:16:26 root ceph-mon: 2015-02-13 09:16:26.638854 7f392c00a700  0
> mon.f@-1(probing) e7  my rank is now 4 (was -1)
> > Feb 13 09:16:26 root ceph-mon: 2015-02-13 09:16:26.639365 7f392c00a700  1
> mon.f@4(synchronizing) e7 sync_obtain_latest_monmap
> > Feb 13 09:16:26 root ceph-mon: 2015-02-13 09:16:26.639494 7f392b008700  0
> -- 10.122.48.11:6789/0 >> 10.122.0.9:6789/0 pipe(0x7f3934ec0580 sd=17 :6789
> s=0 pgs=0 cs=0 l=0 c=0x7f3934eaa680).accept connect_seq 2 vs existing 0
> state connecting
> > Feb 13 09:16:26 root ceph-mon: 2015-02-13 09:16:26.639513 7f392b008700  0
> -- 10.122.48.11:6789/0 >> 10.122.0.9:6789/0 pipe(0x7f3934ec0580 sd=17 :6789
> s=0 pgs=0 cs=0 l=0 c=0x7f3934eaa680).accept we reset (peer sent cseq 2,
> 0x7f3934ebf400.cseq = 0), sending RESETSESSION
> > ......
> > Feb 13 09:16:26 root ceph-mon: 2015-02-13 09:16:26.643273 7f392c00a700  1
> mon.f@4(synchronizing) e7 sync_obtain_latest_monmap obtained monmap e7
> > Feb 13 09:17:26 root ceph-mon: 2015-02-13 09:17:26.611550 7f392c80b700  0
> mon.f@4(synchronizing).data_health(0) update_stats avail 99% total 911815680
> used 33132 avail 911782548
> > Feb 13 09:17:26 root ceph-mon: 2015-02-13 09:17:26.708961 7f392c00a700  1
> mon.f@4(synchronizing) e7 sync_obtain_latest_monmap
> > Feb 13 09:17:26 root ceph-mon: 2015-02-13 09:17:26.709063 7f392c00a700  1
> mon.f@4(synchronizing) e7 sync_obtain_latest_monmap obtained monmap e7
> >
> >
> >
> >
> > someone can help? thank you!
> >
> >
> > minchen
> >
> >
> >
> >
> >
> >
> >
> 
> 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com