Hello, I'm running a ceph-12.2.2 cluster on debian/stretch with three mon servers, unsuccessfully trying to add another (or two additional) mon servers. While the new mon server keeps in state "synchronizing", the old mon servers get out of quorum, endlessly changing state from "peon" to "electing" or "probing", and eventually back to "peon" or "leader". On a small test cluster everthing works as expected, the new mons painlessly join the cluster. But on my production cluster I always run into trouble, both with ceph-deploy and manual intervention. Probably I'm missing some fundamental factor. Maybe anyone can give me a hint? These are the existing mons: my-ceph-mon-3: IP AAA.BBB.CCC.23 my-ceph-mon-4: IP AAA.BBB.CCC.24 my-ceph-mon-5: IP AAA.BBB.CCC.25 Trying to add my-ceph-mon-1: IP AAA.BBB.CCC.31 Here is a (hopefully) relevant and representative part of the logs on my-ceph-mon-5 when my-ceph-mon-1 tries to join: 2018-01-11 15:16:08.340741 7f69ba8db700 0 mon.my-ceph-mon-5@2(peon).data_health(6128) update_stats avail 57% total 19548 MB, used 8411 MB, avail 11149 MB 2018-01-11 15:16:16.830566 7f69b48cf700 0 -- AAA.BBB.CCC.18:6789/0 >> AAA.BBB.CCC.31:6789/0 conn(0x55d19cac2000 :6789 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 0 vs existing csq=1 existing_state=STATE_STANDBY 2018-01-11 15:16:16.830582 7f69b48cf700 0 -- AAA.BBB.CCC.18:6789/0 >> AAA.BBB.CCC.31:6789/0 conn(0x55d19cac2000 :6789 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept peer reset, then tried to connect to us, replacing 2018-01-11 15:16:16.831864 7f69b80d6700 1 mon.my-ceph-mon-5@2(peon) e15 adding peer AAA.BBB.CCC.31:6789/0 to list of hints 2018-01-11 15:16:16.833701 7f69b50d0700 0 -- AAA.BBB.CCC.18:6789/0 >> AAA.BBB.CCC.31:6789/0 conn(0x55d19c8ca000 :6789 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 0 vs existing csq=1 existing_state=STATE_STANDBY 2018-01-11 15:16:16.833713 7f69b50d0700 0 -- AAA.BBB.CCC.18:6789/0 >> AAA.BBB.CCC.31:6789/0 conn(0x55d19c8ca000 :6789 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept peer reset, then tried to connect to us, replacing 2018-01-11 15:16:16.834843 7f69b80d6700 1 mon.my-ceph-mon-5@2(peon) e15 adding peer AAA.BBB.CCC.31:6789/0 to list of hints 2018-01-11 15:16:35.907962 7f69ba8db700 1 mon.my-ceph-mon-5@2(peon).paxos(paxos active c 9653210..9653763) lease_timeout -- calling new election 2018-01-11 15:16:35.908589 7f69b80d6700 0 mon.my-ceph-mon-5@2(probing) e15 handle_command mon_command({"prefix": "status"} v 0) v1 2018-01-11 15:16:35.908630 7f69b80d6700 0 log_channel(audit) log [DBG] : from='client.? 172.25.24.15:0/1078983440' entity='client.admin' cmd=[{"prefix": "status"}]: dispatch 2018-01-11 15:16:35.909124 7f69b80d6700 0 log_channel(cluster) log [INF] : mon.my-ceph-mon-5 calling new monitor election 2018-01-11 15:16:35.909284 7f69b80d6700 1 mon.my-ceph-mon-5@2(electing).elector(6128) init, last seen epoch 6128 2018-01-11 15:16:50.132414 7f69ba8db700 1 mon.my-ceph-mon-5@2(electing).elector(6129) init, last seen epoch 6129, mid-election, bumping 2018-01-11 15:16:55.209177 7f69b80d6700 -1 mon.my-ceph-mon-5@2(peon).paxos(paxos recovering c 9653210..9653777) lease_expire from mon.0 AAA.BBB.CCC.23:6789/0 is 0.032801 seconds in the past; mons are probably laggy (or possibly clocks are too skewed) 2018-01-11 15:17:09.316472 7f69ba8db700 1 mon.my-ceph-mon-5@2(peon).paxos(paxos updating c 9653210..9653778) lease_timeout -- calling new election 2018-01-11 15:17:09.316597 7f69ba8db700 0 mon.my-ceph-mon-5@2(probing).data_health(6134) update_stats avail 57% total 19548 MB, used 8411 MB, avail 11149 MB 2018-01-11 15:17:09.317414 7f69b80d6700 0 log_channel(cluster) log [INF] : mon.my-ceph-mon-5 calling new monitor election 2018-01-11 15:17:09.317517 7f69b80d6700 1 mon.my-ceph-mon-5@2(electing).elector(6134) init, last seen epoch 6134 2018-01-11 15:17:22.059573 7f69ba8db700 1 mon.my-ceph-mon-5@2(peon).paxos(paxos updating c 9653210..9653779) lease_timeout -- calling new election 2018-01-11 15:17:22.060021 7f69b80d6700 1 mon.my-ceph-mon-5@2(probing).data_health(6138) service_dispatch_op not in quorum -- drop message 2018-01-11 15:17:22.060279 7f69b80d6700 1 mon.my-ceph-mon-5@2(probing).data_health(6138) service_dispatch_op not in quorum -- drop message 2018-01-11 15:17:22.060499 7f69b80d6700 0 log_channel(cluster) log [INF] : mon.my-ceph-mon-5 calling new monitor election 2018-01-11 15:17:22.060612 7f69b80d6700 1 mon.my-ceph-mon-5@2(electing).elector(6138) init, last seen epoch 6138 ... As far as I can see clock skew is not a problem (tested with "ntpq -p"). Any idea what might go wrong? Thanks, Thomas _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com