Re: mon crash

Matthew Roy <imjustmatthew@xxxxxxxxx> · Mon, 15 Apr 2013 19:02:21 -0400

I'd bet that's 3495, it looks and sounds really, really similar. A lot
of the devs are at a conference, but if you see Joao on IRC he'd know
for sure.

On 04/15/2013 04:56 PM, Craig Lewis wrote:
> 
> I'm doing a test of Ceph in two colo facilities.  Since it's just a
> test, I only have 2 VMs running, one in each colo.  Both VMs are runing
> mon, mds, a single osd, and the RADOS gw.  Cephx is disabled.  I'm
> testing if the latency between the two facilities (~20ms) is low enough
> that I can run a single Ceph cluster in both locations.  If it doesn't
> work out, I'll run two independent Ceph clusters with manual replication.
> 
> This weekend, the connection between the two locations was degraded. 
> The link had 37% packet loss, for less than a minute.  When the link
> returned to normal, the re-elected mon leader crashed.
> 
> Is this a real bug, or did this happen because I'm only running 2
> nodes?  I'm trying to avoid bringing more nodes into this test.  My VM
> infrastructure is pretty weak, and I'm afraid that more nodes would
> introduce more noise in the test.
> 
> I saw this happen once before (the primary colo had a UPS failure,
> causing a switch reboot).  The same process crashed, with the same stack
> trace.  When that happened, I ran sudo service ceph restart on the
> machine with the crashed mon, and everything started up fine.  I haven't
> restarted anything this time.
> 
> I tried to recreate the problem by stopping and starting the VPN between
> the two locations, but that didn't trigger the crash.  I have some more
> ideas on how to trigger, I'll continue trying today.
> 
> 
> 
> arnulf@ceph0:~$ lsb_release -a
> Distributor ID:    Ubuntu
> Description:    Ubuntu 12.04.2 LTS
> Release:    12.04
> Codename:    precise
> 
> arnulf@ceph0:~$ uname -a
> Linux ceph0 3.5.0-27-generic #46~precise1-Ubuntu SMP Tue Mar 26 19:33:21
> UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
> 
> arnulf@ceph0:~$ cat /etc/apt/sources.list.d/ceph.list
> deb http://ceph.com/debian-testing/ precise main
> 
> arnulf@ceph0:~$ ceph -v
> ceph version 0.60 (f26f7a39021dbf440c28d6375222e21c94fe8e5c)
> 
> 
> ceph-mon.log from the non-elected master, mon.b:
> 2013-04-13 07:57:39.445098 7fde958f4700  0 mon.b@1(peon).data_health(20)
> update_stats avail 85% total 17295768 used 1679152 avail 14738024
> 2013-04-13 07:58:35.150603 7fde950f3700  0 log [INF] : mon.b calling new
> monitor election
> 2013-04-13 07:58:35.150876 7fde950f3700  1 mon.b@1(electing).elector(20)
> init, last seen epoch 20
> 2013-04-13 07:58:39.445355 7fde958f4700  0
> mon.b@1(electing).data_health(20) update_stats avail 85% total 17295768
> used 1679152 avail 14738024
> 2013-04-13 07:58:40.192514 7fde958f4700  1 mon.b@1(electing).elector(21)
> init, last seen epoch 21
> 2013-04-13 07:58:43.748907 7fde93dee700  0 -- 192.168.22.62:6789/0 >>
> 192.168.2.62:6789/0 pipe(0x2c56500 sd=25 :6789 s=2 pgs=108 cs=1
> l=0).fault, initiating reconnect
> 2013-04-13 07:58:43.786209 7fde93ff0700  0 -- 192.168.22.62:6789/0 >>
> 192.168.2.62:6789/0 pipe(0x2c56500 sd=8 :6789 s=1 pgs=108 cs=2 l=0).fault
> 2013-04-13 07:59:13.050245 7fde958f4700  1 mon.b@1(probing) e1
> discarding message auth(proto 0 26 bytes epoch 1) v1 and sending client
> elsewhere; we are not in quorum
> 2013-04-13 07:59:13.050277 7fde958f4700  1 mon.b@1(probing) e1
> discarding message auth(proto 0 34 bytes epoch 1) v1 and sending client
> elsewhere; we are not in quorum
> 2013-04-13 07:59:13.050285 7fde958f4700  1 mon.b@1(probing) e1
> discarding message auth(proto 0 26 bytes epoch 1) v1 and sending client
> elsewhere; we are not in quorum
> ...
> 
> ceph-mon.log from the elected master, mon.a:
> 2013-04-13 07:57:41.756844 7f162be82700  0
> mon.a@0(leader).data_health(20) update_stats avail 84% total 17295768
> used 1797312 avail 14619864
> 2013-04-13 07:58:35.210875 7f162b681700  0 log [INF] : mon.a calling new
> monitor election
> 2013-04-13 07:58:35.211081 7f162b681700  1 mon.a@0(electing).elector(20)
> init, last seen epoch 20
> 2013-04-13 07:58:40.270547 7f162be82700  1 mon.a@0(electing).elector(21)
> init, last seen epoch 21
> 2013-04-13 07:58:41.757032 7f162be82700  0
> mon.a@0(electing).data_health(20) update_stats avail 84% total 17295768
> used 1797312 avail 14619864
> 2013-04-13 07:58:43.441306 7f162b681700  0 log [INF] : mon.a@0 won
> leader election with quorum 0,1
> 2013-04-13 07:58:43.560319 7f162b681700  0 log [INF] : pgmap v1684: 632
> pgs: 632 active+clean; 9982 bytes data, 2079 MB used, 100266 MB / 102346
> MB avail; 0B/s rd, 0B/s wr, 0op/s
> 2013-04-13 07:58:43.561722 7f162b681700 -1 mon/PaxosService.cc: In
> function 'void PaxosService::propose_pending()' thread 7f162b681700 time
> 2013-04-13 07:58:43.560456
> mon/PaxosService.cc: 127: FAILED assert(have_pending)
> 
>  ceph version 0.60 (f26f7a39021dbf440c28d6375222e21c94fe8e5c)
>  1: (PaxosService::propose_pending()+0x46d) [0x4dee3d]
>  2: (MDSMonitor::tick()+0x1c62) [0x51cdd2]
>  3: (MDSMonitor::on_active()+0x1a) [0x512ada]
>  4: (PaxosService::_active()+0x31d) [0x4e067d]
>  5: (Context::complete(int)+0xa) [0x4b7b4a]
>  6: (finish_contexts(CephContext*, std::list<Context*,
> std::allocator<Context*> >&, int)+0x95) [0x4ba5a5]
>  7: (Paxos::handle_last(MMonPaxos*)+0xbef) [0x4da92f]
>  8: (Paxos::dispatch(PaxosServiceMessage*)+0x26b) [0x4dad8b]
>  9: (Monitor::_ms_dispatch(Message*)+0x149f) [0x4b310f]
>  10: (Monitor::ms_dispatch(Message*)+0x32) [0x4c9d12]
>  11: (DispatchQueue::entry()+0x341) [0x698da1]
>  12: (DispatchQueue::DispatchThread::entry()+0xd) [0x626c5d]
>  13: (()+0x7e9a) [0x7f1630c76e9a]
>  14: (clone()+0x6d) [0x7f162f911cbd]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
> needed to interpret this.
> 
> 
> I saved the log files from both events, in case you want the --- begin
> dump of recent events ---. 
> 
> This looks somewhat similiar to http://tracker.ceph.com/issues/3495. 
> The first 6 lines of the stack trace are the same functions, but they
> diverge at line 7.
> 
> 
> -- 
> 
> *Craig Lewis*
> Senior Systems Engineer
> Office +1.714.602.1309
> Email clewis@xxxxxxxxxxxxxxxxxx <mailto:clewis@xxxxxxxxxxxxxxxxxx>
> 
> *Central Desktop. Work together in ways you never thought possible.*
> Connect with us   Website <http://www.centraldesktop.com/>  |  Twitter
> <http://www.twitter.com/centraldesktop>  |  Facebook
> <http://www.facebook.com/CentralDesktop>  |  LinkedIn
> <http://www.linkedin.com/groups?gid=147417>  |  Blog
> <http://cdblog.centraldesktop.com/>
> 
> 
> 
> 
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

-- 
Matthew
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com