Re: Monitors stuck in "electing"

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 






On Tue, Mar 25, 2014 at 12:53 PM, Gregory Farnum <greg@xxxxxxxxxxx> wrote:
On Tue, Mar 25, 2014 at 9:24 AM, Travis Rhoden <trhoden@xxxxxxxxx> wrote:
> Okay, last one until I get some guidance.  Sorry for the spam, but wanted to
> paint a full picture.  Here are debug logs from all three mons, capturing
> what looks like an election sequence to me:
>
> ceph0:
> 2014-03-25 16:17:24.324846 7fa5c53fc700  5 mon.ceph0@0(electing).elector(35)
> start -- can i be leader?
> 2014-03-25 16:17:24.324900 7fa5c53fc700  1 mon.ceph0@0(electing).elector(35)
> init, last seen epoch 35
> 2014-03-25 16:17:24.324913 7fa5c53fc700  1 -- 10.10.30.0:6789/0 --> mon.1
> 10.10.30.1:6789/0 -- election(b3f38955-4321-4850-9ddb-3b09940dc951 propose
> 35) v4 -- ?+0 0x263d480
> 2014-03-25 16:17:24.324948 7fa5c53fc700  1 -- 10.10.30.0:6789/0 --> mon.2
> 10.10.30.2:6789/0 -- election(b3f38955-4321-4850-9ddb-3b09940dc951 propose
> 35) v4 -- ?+0 0x263d6c0
> 2014-03-25 16:17:25.353975 7fa5c4bfb700  1 -- 10.10.30.0:6789/0 <== mon.2
> 10.10.30.2:6789/0 493 ==== election(b3f38955-4321-4850-9ddb-3b09940dc951
> propose 35) v4 ==== 537+0+0 (4036841703 0 0) 0x265fd80 con 0x1df0c60
> 2014-03-25 16:17:25.354042 7fa5c4bfb700  5 mon.ceph0@0(electing).elector(35)
> handle_propose from mon.2
> 2014-03-25 16:17:29.325107 7fa5c53fc700  5 mon.ceph0@0(electing).elector(35)
> election timer expired
>
> ceph1:
> 2014-03-25 16:17:24.325529 7ffe48cc1700  5 mon.ceph1@1(electing).elector(35)
> handle_propose from mon.0
> 2014-03-25 16:17:24.325535 7ffe48cc1700  5 mon.ceph1@1(electing).elector(35)
> defer to 0
> 2014-03-25 16:17:24.325546 7ffe48cc1700  1 -- 10.10.30.1:6789/0 --> mon.0
> 10.10.30.0:6789/0 -- election(b3f38955-4321-4850-9ddb-3b09940dc951 ack 35)
> v4 -- ?+0 0x1bbfb40
> 2014-03-25 16:17:25.354038 7ffe48cc1700  1 -- 10.10.30.1:6789/0 <== mon.2
> 10.10.30.2:6789/0 489 ==== election(b3f38955-4321-4850-9ddb-3b09940dc951
> propose 35) v4 ==== 537+0+0 (4036841703 0 0) 0x1bbf6c0 con 0x14d9b00
> 2014-03-25 16:17:25.354102 7ffe48cc1700  5 mon.ceph1@1(electing).elector(35)
> handle_propose from mon.2
> 2014-03-25 16:17:25.354113 7ffe48cc1700  5 mon.ceph1@1(electing).elector(35)
> no, we already acked 0
>
> ceph2:
> 2014-03-25 16:17:20.353135 7f80d0013700  5 mon.ceph2@2(electing).elector(35)
> election timer expired
> 2014-03-25 16:17:20.353154 7f80d0013700  5 mon.ceph2@2(electing).elector(35)
> start -- can i be leader?
> 2014-03-25 16:17:20.353225 7f80d0013700  1 mon.ceph2@2(electing).elector(35)
> init, last seen epoch 35
> 2014-03-25 16:17:20.353238 7f80d0013700  1 -- 10.10.30.2:6789/0 --> mon.0
> 10.10.30.0:6789/0 -- election(b3f38955-4321-4850-9ddb-3b09940dc951 propose
> 35) v4 -- ?+0 0x18e7900
> 2014-03-25 16:17:20.353272 7f80d0013700  1 -- 10.10.30.2:6789/0 --> mon.1
> 10.10.30.1:6789/0 -- election(b3f38955-4321-4850-9ddb-3b09940dc951 propose
> 35) v4 -- ?+0 0x18e7d80
> 2014-03-25 16:17:25.353559 7f80d0013700  5 mon.ceph2@2(electing).elector(35)
> election timer expired
> 2014-03-25 16:17:25.353578 7f80d0013700  5 mon.ceph2@2(electing).elector(35)
> start -- can i be leader?
> 2014-03-25 16:17:25.353647 7f80d0013700  1 mon.ceph2@2(electing).elector(35)
> init, last seen epoch 35
> 2014-03-25 16:17:25.353660 7f80d0013700  1 -- 10.10.30.2:6789/0 --> mon.0
> 10.10.30.0:6789/0 -- election(b3f38955-4321-4850-9ddb-3b09940dc951 propose
> 35) v4 -- ?+0 0x19b7240
> 2014-03-25 16:17:25.353695 7f80d0013700  1 -- 10.10.30.2:6789/0 --> mon.1
> 10.10.30.1:6789/0 -- election(b3f38955-4321-4850-9ddb-3b09940dc951 propose
> 35) v4 -- ?+0 0x19b76c0
> 2014-03-25 16:17:30.354040 7f80d0013700  5 mon.ceph2@2(electing).elector(35)
> election timer expired
>
> Oddly, it looks to me like mon.2 (ceph2) never handles/receives the proposal
> from mon.0 (ceph0).  But I admit I have no clue how monitor election works.

Likewise, mon.1 sends an ack to mon.0 that is never received. I think
you've got a busted firewall that's allowing some one-way
communication, but not two-way.

Hmm, I see that is missing as well.  There is no hardware firewall in play, and I've check iptables on all nodes -- no firewall rules are in place at this time.  Just default ACCEPT rules everywhere.

However, I do see some weirdness here:

ceph0:
# netstat -atln
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State     
tcp        0      0 10.10.30.0:6789         0.0.0.0:*               LISTEN    
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN    
tcp        0      0 10.1.30.0:22            10.1.0.1:33129          ESTABLISHED
tcp        0  85128 10.10.30.0:40760        10.10.30.2:6789         ESTABLISHED
tcp        0      0 10.10.30.0:43920        10.10.30.1:6789         ESTABLISHED
tcp6       0      0 :::22                   :::*                    LISTEN  


ceph1:
# netstat -atln
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State     
tcp        0      0 10.10.30.1:6789         0.0.0.0:*               LISTEN    
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN    
tcp        0  79643 10.10.30.1:6789         10.10.30.0:43920        ESTABLISHED
tcp        0      0 10.10.30.1:41876        10.10.30.2:6789         ESTABLISHED
tcp        0      0 10.1.30.1:22            10.1.0.1:38382          ESTABLISHED
tcp6       0      0 :::22                   :::*                    LISTEN   
 

ceph2:
# netstat -atln
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State     
tcp        0      0 10.10.30.2:6789         0.0.0.0:*               LISTEN    
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN    
tcp        0      0 10.10.30.2:6789         10.10.30.1:41876        ESTABLISHED
tcp        0      0 10.1.30.2:22            10.1.0.1:58195          ESTABLISHED
tcp        0      0 10.10.30.2:6789         10.10.30.0:40760        ESTABLISHED
tcp6       0      0 :::22                   :::*                    LISTEN


Running those netstat command multiple times shows the Send-Q backed up on both ceph0 and ceph1.  The number stays the same.

So I see that mon.0 is backed up sending to mon.2, and that mon.1 is backed up sending to mon.0.  And mon.2 has no backup.

Still baffled.

-Greg 
Software Engineer #42 @ http://inktank.com | http://ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux