I'd bet that's 3495, it looks and sounds really, really similar. A lot of the devs are at a conference, but if you see Joao on IRC he'd know for sure. On 04/15/2013 04:56 PM, Craig Lewis wrote: > > I'm doing a test of Ceph in two colo facilities. Since it's just a > test, I only have 2 VMs running, one in each colo. Both VMs are runing > mon, mds, a single osd, and the RADOS gw. Cephx is disabled. I'm > testing if the latency between the two facilities (~20ms) is low enough > that I can run a single Ceph cluster in both locations. If it doesn't > work out, I'll run two independent Ceph clusters with manual replication. > > This weekend, the connection between the two locations was degraded. > The link had 37% packet loss, for less than a minute. When the link > returned to normal, the re-elected mon leader crashed. > > Is this a real bug, or did this happen because I'm only running 2 > nodes? I'm trying to avoid bringing more nodes into this test. My VM > infrastructure is pretty weak, and I'm afraid that more nodes would > introduce more noise in the test. > > I saw this happen once before (the primary colo had a UPS failure, > causing a switch reboot). The same process crashed, with the same stack > trace. When that happened, I ran sudo service ceph restart on the > machine with the crashed mon, and everything started up fine. I haven't > restarted anything this time. > > I tried to recreate the problem by stopping and starting the VPN between > the two locations, but that didn't trigger the crash. I have some more > ideas on how to trigger, I'll continue trying today. > > > > arnulf@ceph0:~$ lsb_release -a > Distributor ID: Ubuntu > Description: Ubuntu 12.04.2 LTS > Release: 12.04 > Codename: precise > > arnulf@ceph0:~$ uname -a > Linux ceph0 3.5.0-27-generic #46~precise1-Ubuntu SMP Tue Mar 26 19:33:21 > UTC 2013 x86_64 x86_64 x86_64 GNU/Linux > > arnulf@ceph0:~$ cat /etc/apt/sources.list.d/ceph.list > deb http://ceph.com/debian-testing/ precise main > > arnulf@ceph0:~$ ceph -v > ceph version 0.60 (f26f7a39021dbf440c28d6375222e21c94fe8e5c) > > > ceph-mon.log from the non-elected master, mon.b: > 2013-04-13 07:57:39.445098 7fde958f4700 0 mon.b@1(peon).data_health(20) > update_stats avail 85% total 17295768 used 1679152 avail 14738024 > 2013-04-13 07:58:35.150603 7fde950f3700 0 log [INF] : mon.b calling new > monitor election > 2013-04-13 07:58:35.150876 7fde950f3700 1 mon.b@1(electing).elector(20) > init, last seen epoch 20 > 2013-04-13 07:58:39.445355 7fde958f4700 0 > mon.b@1(electing).data_health(20) update_stats avail 85% total 17295768 > used 1679152 avail 14738024 > 2013-04-13 07:58:40.192514 7fde958f4700 1 mon.b@1(electing).elector(21) > init, last seen epoch 21 > 2013-04-13 07:58:43.748907 7fde93dee700 0 -- 192.168.22.62:6789/0 >> > 192.168.2.62:6789/0 pipe(0x2c56500 sd=25 :6789 s=2 pgs=108 cs=1 > l=0).fault, initiating reconnect > 2013-04-13 07:58:43.786209 7fde93ff0700 0 -- 192.168.22.62:6789/0 >> > 192.168.2.62:6789/0 pipe(0x2c56500 sd=8 :6789 s=1 pgs=108 cs=2 l=0).fault > 2013-04-13 07:59:13.050245 7fde958f4700 1 mon.b@1(probing) e1 > discarding message auth(proto 0 26 bytes epoch 1) v1 and sending client > elsewhere; we are not in quorum > 2013-04-13 07:59:13.050277 7fde958f4700 1 mon.b@1(probing) e1 > discarding message auth(proto 0 34 bytes epoch 1) v1 and sending client > elsewhere; we are not in quorum > 2013-04-13 07:59:13.050285 7fde958f4700 1 mon.b@1(probing) e1 > discarding message auth(proto 0 26 bytes epoch 1) v1 and sending client > elsewhere; we are not in quorum > ... > > ceph-mon.log from the elected master, mon.a: > 2013-04-13 07:57:41.756844 7f162be82700 0 > mon.a@0(leader).data_health(20) update_stats avail 84% total 17295768 > used 1797312 avail 14619864 > 2013-04-13 07:58:35.210875 7f162b681700 0 log [INF] : mon.a calling new > monitor election > 2013-04-13 07:58:35.211081 7f162b681700 1 mon.a@0(electing).elector(20) > init, last seen epoch 20 > 2013-04-13 07:58:40.270547 7f162be82700 1 mon.a@0(electing).elector(21) > init, last seen epoch 21 > 2013-04-13 07:58:41.757032 7f162be82700 0 > mon.a@0(electing).data_health(20) update_stats avail 84% total 17295768 > used 1797312 avail 14619864 > 2013-04-13 07:58:43.441306 7f162b681700 0 log [INF] : mon.a@0 won > leader election with quorum 0,1 > 2013-04-13 07:58:43.560319 7f162b681700 0 log [INF] : pgmap v1684: 632 > pgs: 632 active+clean; 9982 bytes data, 2079 MB used, 100266 MB / 102346 > MB avail; 0B/s rd, 0B/s wr, 0op/s > 2013-04-13 07:58:43.561722 7f162b681700 -1 mon/PaxosService.cc: In > function 'void PaxosService::propose_pending()' thread 7f162b681700 time > 2013-04-13 07:58:43.560456 > mon/PaxosService.cc: 127: FAILED assert(have_pending) > > ceph version 0.60 (f26f7a39021dbf440c28d6375222e21c94fe8e5c) > 1: (PaxosService::propose_pending()+0x46d) [0x4dee3d] > 2: (MDSMonitor::tick()+0x1c62) [0x51cdd2] > 3: (MDSMonitor::on_active()+0x1a) [0x512ada] > 4: (PaxosService::_active()+0x31d) [0x4e067d] > 5: (Context::complete(int)+0xa) [0x4b7b4a] > 6: (finish_contexts(CephContext*, std::list<Context*, > std::allocator<Context*> >&, int)+0x95) [0x4ba5a5] > 7: (Paxos::handle_last(MMonPaxos*)+0xbef) [0x4da92f] > 8: (Paxos::dispatch(PaxosServiceMessage*)+0x26b) [0x4dad8b] > 9: (Monitor::_ms_dispatch(Message*)+0x149f) [0x4b310f] > 10: (Monitor::ms_dispatch(Message*)+0x32) [0x4c9d12] > 11: (DispatchQueue::entry()+0x341) [0x698da1] > 12: (DispatchQueue::DispatchThread::entry()+0xd) [0x626c5d] > 13: (()+0x7e9a) [0x7f1630c76e9a] > 14: (clone()+0x6d) [0x7f162f911cbd] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is > needed to interpret this. > > > I saved the log files from both events, in case you want the --- begin > dump of recent events ---. > > This looks somewhat similiar to http://tracker.ceph.com/issues/3495. > The first 6 lines of the stack trace are the same functions, but they > diverge at line 7. > > > -- > > *Craig Lewis* > Senior Systems Engineer > Office +1.714.602.1309 > Email clewis@xxxxxxxxxxxxxxxxxx <mailto:clewis@xxxxxxxxxxxxxxxxxx> > > *Central Desktop. Work together in ways you never thought possible.* > Connect with us Website <http://www.centraldesktop.com/> | Twitter > <http://www.twitter.com/centraldesktop> | Facebook > <http://www.facebook.com/CentralDesktop> | LinkedIn > <http://www.linkedin.com/groups?gid=147417> | Blog > <http://cdblog.centraldesktop.com/> > > > > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- Matthew _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com