On Tue, Mar 25, 2014 at 12:53 PM, Gregory Farnum <greg@xxxxxxxxxxx> wrote:
Likewise, mon.1 sends an ack to mon.0 that is never received. I thinkOn Tue, Mar 25, 2014 at 9:24 AM, Travis Rhoden <trhoden@xxxxxxxxx> wrote:
> Okay, last one until I get some guidance. Sorry for the spam, but wanted to
> paint a full picture. Here are debug logs from all three mons, capturing
> what looks like an election sequence to me:
>
> ceph0:
> 2014-03-25 16:17:24.324846 7fa5c53fc700 5 mon.ceph0@0(electing).elector(35)
> start -- can i be leader?
> 2014-03-25 16:17:24.324900 7fa5c53fc700 1 mon.ceph0@0(electing).elector(35)
> init, last seen epoch 35
> 2014-03-25 16:17:24.324913 7fa5c53fc700 1 -- 10.10.30.0:6789/0 --> mon.1
> 10.10.30.1:6789/0 -- election(b3f38955-4321-4850-9ddb-3b09940dc951 propose
> 35) v4 -- ?+0 0x263d480
> 2014-03-25 16:17:24.324948 7fa5c53fc700 1 -- 10.10.30.0:6789/0 --> mon.2
> 10.10.30.2:6789/0 -- election(b3f38955-4321-4850-9ddb-3b09940dc951 propose
> 35) v4 -- ?+0 0x263d6c0
> 2014-03-25 16:17:25.353975 7fa5c4bfb700 1 -- 10.10.30.0:6789/0 <== mon.2
> 10.10.30.2:6789/0 493 ==== election(b3f38955-4321-4850-9ddb-3b09940dc951
> propose 35) v4 ==== 537+0+0 (4036841703 0 0) 0x265fd80 con 0x1df0c60
> 2014-03-25 16:17:25.354042 7fa5c4bfb700 5 mon.ceph0@0(electing).elector(35)
> handle_propose from mon.2
> 2014-03-25 16:17:29.325107 7fa5c53fc700 5 mon.ceph0@0(electing).elector(35)
> election timer expired
>
> ceph1:
> 2014-03-25 16:17:24.325529 7ffe48cc1700 5 mon.ceph1@1(electing).elector(35)
> handle_propose from mon.0
> 2014-03-25 16:17:24.325535 7ffe48cc1700 5 mon.ceph1@1(electing).elector(35)
> defer to 0
> 2014-03-25 16:17:24.325546 7ffe48cc1700 1 -- 10.10.30.1:6789/0 --> mon.0
> 10.10.30.0:6789/0 -- election(b3f38955-4321-4850-9ddb-3b09940dc951 ack 35)
> v4 -- ?+0 0x1bbfb40
> 2014-03-25 16:17:25.354038 7ffe48cc1700 1 -- 10.10.30.1:6789/0 <== mon.2
> 10.10.30.2:6789/0 489 ==== election(b3f38955-4321-4850-9ddb-3b09940dc951
> propose 35) v4 ==== 537+0+0 (4036841703 0 0) 0x1bbf6c0 con 0x14d9b00
> 2014-03-25 16:17:25.354102 7ffe48cc1700 5 mon.ceph1@1(electing).elector(35)
> handle_propose from mon.2
> 2014-03-25 16:17:25.354113 7ffe48cc1700 5 mon.ceph1@1(electing).elector(35)
> no, we already acked 0
>
> ceph2:
> 2014-03-25 16:17:20.353135 7f80d0013700 5 mon.ceph2@2(electing).elector(35)
> election timer expired
> 2014-03-25 16:17:20.353154 7f80d0013700 5 mon.ceph2@2(electing).elector(35)
> start -- can i be leader?
> 2014-03-25 16:17:20.353225 7f80d0013700 1 mon.ceph2@2(electing).elector(35)
> init, last seen epoch 35
> 2014-03-25 16:17:20.353238 7f80d0013700 1 -- 10.10.30.2:6789/0 --> mon.0
> 10.10.30.0:6789/0 -- election(b3f38955-4321-4850-9ddb-3b09940dc951 propose
> 35) v4 -- ?+0 0x18e7900
> 2014-03-25 16:17:20.353272 7f80d0013700 1 -- 10.10.30.2:6789/0 --> mon.1
> 10.10.30.1:6789/0 -- election(b3f38955-4321-4850-9ddb-3b09940dc951 propose
> 35) v4 -- ?+0 0x18e7d80
> 2014-03-25 16:17:25.353559 7f80d0013700 5 mon.ceph2@2(electing).elector(35)
> election timer expired
> 2014-03-25 16:17:25.353578 7f80d0013700 5 mon.ceph2@2(electing).elector(35)
> start -- can i be leader?
> 2014-03-25 16:17:25.353647 7f80d0013700 1 mon.ceph2@2(electing).elector(35)
> init, last seen epoch 35
> 2014-03-25 16:17:25.353660 7f80d0013700 1 -- 10.10.30.2:6789/0 --> mon.0
> 10.10.30.0:6789/0 -- election(b3f38955-4321-4850-9ddb-3b09940dc951 propose
> 35) v4 -- ?+0 0x19b7240
> 2014-03-25 16:17:25.353695 7f80d0013700 1 -- 10.10.30.2:6789/0 --> mon.1
> 10.10.30.1:6789/0 -- election(b3f38955-4321-4850-9ddb-3b09940dc951 propose
> 35) v4 -- ?+0 0x19b76c0
> 2014-03-25 16:17:30.354040 7f80d0013700 5 mon.ceph2@2(electing).elector(35)
> election timer expired
>
> Oddly, it looks to me like mon.2 (ceph2) never handles/receives the proposal
> from mon.0 (ceph0). But I admit I have no clue how monitor election works.
you've got a busted firewall that's allowing some one-way
communication, but not two-way.
Hmm, I see that is missing as well. There is no hardware firewall in play, and I've check iptables on all nodes -- no firewall rules are in place at this time. Just default ACCEPT rules everywhere.
However, I do see some weirdness here:
ceph0:
# netstat -atln
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 0 10.10.30.0:6789 0.0.0.0:* LISTEN
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN
tcp 0 0 10.1.30.0:22 10.1.0.1:33129 ESTABLISHED
tcp 0 85128 10.10.30.0:40760 10.10.30.2:6789 ESTABLISHED
tcp 0 0 10.10.30.0:43920 10.10.30.1:6789 ESTABLISHED
tcp6 0 0 :::22 :::* LISTEN
ceph1:
# netstat -atln
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 0 10.10.30.1:6789 0.0.0.0:* LISTEN
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN
tcp 0 79643 10.10.30.1:6789 10.10.30.0:43920 ESTABLISHED
tcp 0 0 10.10.30.1:41876 10.10.30.2:6789 ESTABLISHED
tcp 0 0 10.1.30.1:22 10.1.0.1:38382 ESTABLISHED
tcp6 0 0 :::22 :::* LISTEN
# netstat -atln
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 0 10.10.30.0:6789 0.0.0.0:* LISTEN
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN
tcp 0 0 10.1.30.0:22 10.1.0.1:33129 ESTABLISHED
tcp 0 85128 10.10.30.0:40760 10.10.30.2:6789 ESTABLISHED
tcp 0 0 10.10.30.0:43920 10.10.30.1:6789 ESTABLISHED
tcp6 0 0 :::22 :::* LISTEN
ceph1:
# netstat -atln
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 0 10.10.30.1:6789 0.0.0.0:* LISTEN
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN
tcp 0 79643 10.10.30.1:6789 10.10.30.0:43920 ESTABLISHED
tcp 0 0 10.10.30.1:41876 10.10.30.2:6789 ESTABLISHED
tcp 0 0 10.1.30.1:22 10.1.0.1:38382 ESTABLISHED
tcp6 0 0 :::22 :::* LISTEN
ceph2:
# netstat -atln
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 0 10.10.30.2:6789 0.0.0.0:* LISTEN
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN
tcp 0 0 10.10.30.2:6789 10.10.30.1:41876 ESTABLISHED
tcp 0 0 10.1.30.2:22 10.1.0.1:58195 ESTABLISHED
tcp 0 0 10.10.30.2:6789 10.10.30.0:40760 ESTABLISHED
tcp6 0 0 :::22 :::* LISTEN
# netstat -atln
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 0 10.10.30.2:6789 0.0.0.0:* LISTEN
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN
tcp 0 0 10.10.30.2:6789 10.10.30.1:41876 ESTABLISHED
tcp 0 0 10.1.30.2:22 10.1.0.1:58195 ESTABLISHED
tcp 0 0 10.10.30.2:6789 10.10.30.0:40760 ESTABLISHED
tcp6 0 0 :::22 :::* LISTEN
Running those netstat command multiple times shows the Send-Q backed up on both ceph0 and ceph1. The number stays the same.
So I see that mon.0 is backed up sending to mon.2, and that mon.1 is backed up sending to mon.0. And mon.2 has no backup.
So I see that mon.0 is backed up sending to mon.2, and that mon.1 is backed up sending to mon.0. And mon.2 has no backup.
Still baffled.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com