new mon can't join new cluster, probe_timeout / probing

grin <grin@xxxxxxx> · Wed, 23 Nov 2016 18:50:02 +0100

Hello,

[This is hammer, 0.94.9, since proxmox waits for the new jewel release
due to some relevant fixes.]

This is possibly some network issue, but I cannot see the indicator
about what to see. mon0 usually stands in quorum alone, and other mons
cannot join. They get the monmap, they intend to join, but it just
never happens, mons get from synchronising to probing, forever. Raising
log level doesn't reveal anything to me.

cluster network and public network differs, and mons are supposed to be
on the public network. 

mon0:
2016-11-23 16:26:16.920691 7f8f193da700  1 mon.0@0(leader) e1  adding peer 10.75.13.132:6789/0 to list of hints
2016-11-23 16:26:18.922057 7f8f193da700  1 mon.0@0(leader) e1  adding peer 10.75.13.132:6789/0 to list of hints
2016-11-23 16:26:20.923695 7f8f193da700  1 mon.0@0(leader) e1  adding peer 10.75.13.132:6789/0 to list of hints
2016-11-23 16:26:22.925172 7f8f193da700  1 mon.0@0(leader) e1  adding peer 10.75.13.132:6789/0 to list of hints
...forever

mon1:
2016-11-23 16:25:14.887453 7fe81a87f880  0 ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90), process ceph-mon, pid 8956
2016-11-23 16:25:14.909873 7fe81a87f880  0 mon.1 does not exist in monmap, will attempt to join an existing cluster
2016-11-23 16:25:14.910934 7fe81a87f880  0 using public_addr 10.75.13.132:0/0 -> 10.75.13.132:6789/0
2016-11-23 16:25:14.911012 7fe81a87f880  0 starting mon.1 rank -1 at 10.75.13.132:6789/0 mon_data /var/lib/ceph/mon/ceph-1 fsid ca404f45-def3-4c22-a83b-7939e3f92514
2016-11-23 16:25:14.911406 7fe81a87f880  1 mon.1@-1(probing) e0 preinit fsid ca404f45-def3-4c22-a83b-7939e3f92514
2016-11-23 16:26:14.912255 7fe8137f8700  0 mon.1@-1(synchronizing).data_health(0) update_stats avail 92% total 69923 MB, used 1552 MB, avail 64796 MB
2016-11-23 16:27:14.912613 7fe8137f8700  0 mon.1@-1(probing).data_health(0) update_stats avail 92% total 69923 MB, used 1552 MB, avail 64796 MB
2016-11-23 16:28:14.912868 7fe8137f8700  0 mon.1@-1(probing).data_health(0) update_stats avail 92% total 69923 MB, used 1552 MB, avail 64796 MB
...forever as well.

Raising on mon0:
2016-11-23 17:19:11.330786 7f8f20c60700 10 _calc_signature seq 366 front_crc_ = 1411686358 middle_crc = 0 data_crc = 0 sig = 16063324873821844002
2016-11-23 17:19:11.330928 7f8f193da700 20 mon.0@0(leader) e1 have connection
2016-11-23 17:19:11.330937 7f8f193da700 20 mon.0@0(leader) e1 ms_dispatch existing session MonSession: mon.? 10.75.13.132:6789/0 is openallow * for mon.? 10.75.13.132:6789/0
2016-11-23 17:19:11.330947 7f8f193da700 20 mon.0@0(leader) e1  caps allow *
2016-11-23 17:19:11.330953 7f8f193da700 20 is_capable service=mon command= read on cap allow *
2016-11-23 17:19:11.330956 7f8f193da700 20  allow so far , doing grant allow *
2016-11-23 17:19:11.330958 7f8f193da700 20  allow all
2016-11-23 17:19:11.330961 7f8f193da700 10 mon.0@0(leader) e1 handle_probe mon_probe(probe ca404f45-def3-4c22-a83b-7939e3f92514 name 1 new) v6
2016-11-23 17:19:11.330969 7f8f193da700 10 mon.0@0(leader) e1 handle_probe_probe mon.? 10.75.13.132:6789/0mon_probe(probe ca404f45-def3-4c22-a83b-7939e3f92514 name 1 new) v6 features 55169095435288575
2016-11-23 17:19:11.331009 7f8f193da700  1 mon.0@0(leader) e1  adding peer 10.75.13.132:6789/0 to list of hints
2016-11-23 17:19:11.331129 7f8f173d6700 10 _calc_signature seq 442670678 front_crc_ = 1084090475 middle_crc = 0 data_crc = 0 sig = 15627235992780641097
2016-11-23 17:19:11.331164 7f8f173d6700 20 Putting signature in client message(seq # 442670678): sig = 15627235992780641097
2016-11-23 17:19:13.344756 7f8f20c60700 10 _calc_signature seq 367 front_crc_ = 1411686358 middle_crc = 0 data_crc = 0 sig = 10295634500541529978
2016-11-23 17:19:13.344931 7f8f193da700 20 mon.0@0(leader) e1 have connection
2016-11-23 17:19:13.344940 7f8f193da700 20 mon.0@0(leader) e1 ms_dispatch existing session MonSession: mon.? 10.75.13.132:6789/0 is openallow * for mon.? 10.75.13.132:6789/0
2016-11-23 17:19:13.344952 7f8f193da700 20 mon.0@0(leader) e1  caps allow *
2016-11-23 17:19:13.344959 7f8f193da700 20 is_capable service=mon command= read on cap allow *
2016-11-23 17:19:13.344962 7f8f193da700 20  allow so far , doing grant allow *
2016-11-23 17:19:13.344964 7f8f193da700 20  allow all
2016-11-23 17:19:13.344967 7f8f193da700 10 mon.0@0(leader) e1 handle_probe mon_probe(probe ca404f45-def3-4c22-a83b-7939e3f92514 name 1 new) v6
2016-11-23 17:19:13.344975 7f8f193da700 10 mon.0@0(leader) e1 handle_probe_probe mon.? 10.75.13.132:6789/0mon_probe(probe ca404f45-def3-4c22-a83b-7939e3f92514 name 1 new) v6 features 55169095435288575
2016-11-23 17:19:13.345019 7f8f193da700  1 mon.0@0(leader) e1  adding peer 10.75.13.132:6789/0 to list of hints

mon1 sometimes says like:
2016-11-23 17:06:04.241491 7f7c3f855700  0 -- 10.75.13.132:6789/0 >> 10.75.13.131:6789/0 pipe(0x3ae4000 sd=13 :53558 s=2 pgs=106 cs=1 l=0 c=0x3937600).reader missed message?  skipped from seq 0 to 64927996
2016-11-23 17:06:04.241620 7f7c41859700  0 mon.1@1(probing) e1  my rank is now -1 (was 1)
2016-11-23 17:06:04.242622 7f7c3f855700  0 -- 10.75.13.132:6789/0 >> 10.75.13.131:6789/0 pipe(0x3ae4000 sd=22 :6789 s=0 pgs=0 cs=0 l=0 c=0x3938260).accept connect_seq 2 vs existing 0 state connecting
2016-11-23 17:06:04.242633 7f7c3f855700  0 -- 10.75.13.132:6789/0 >> 10.75.13.131:6789/0 pipe(0x3ae4000 sd=22 :6789 s=0 pgs=0 cs=0 l=0 c=0x3938260).accept we reset (peer sent cseq 2, 0x3ae9000.cseq = 0), sending RESETSESSION
2016-11-23 17:06:04.243404 7f7c3f855700  0 -- 10.75.13.132:6789/0 >> 10.75.13.131:6789/0 pipe(0x3ae9000 sd=13 :53560 s=2 pgs=108 cs=1 l=0 c=0x3937e40).reader missed message?  skipped from seq 0 to 442670313

but I can't tell if that's a problem or just business as usual.

I've tried various things, trying forcing quorum or resync with no
sucecss. 
The network is connected through a bridge with a bonded
(etherchannel) interface, but ping works well. All of them Debian linux
(based). As far as I see it shouldn't use multicast, so no mc related
problems should come into the equation.

All the system was purged several times on various levels. It is
supposed to work, as it's exactly the same as other configs.

I'm kind of out of ideas where to see.

Thanks,
Peter
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com