mon crash

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




I'm doing a test of Ceph in two colo facilities.  Since it's just a test, I only have 2 VMs running, one in each colo.  Both VMs are runing mon, mds, a single osd, and the RADOS gw.  Cephx is disabled.  I'm testing if the latency between the two facilities (~20ms) is low enough that I can run a single Ceph cluster in both locations.  If it doesn't work out, I'll run two independent Ceph clusters with manual replication.

This weekend, the connection between the two locations was degraded.  The link had 37% packet loss, for less than a minute.  When the link returned to normal, the re-elected mon leader crashed.

Is this a real bug, or did this happen because I'm only running 2 nodes?  I'm trying to avoid bringing more nodes into this test.  My VM infrastructure is pretty weak, and I'm afraid that more nodes would introduce more noise in the test.

I saw this happen once before (the primary colo had a UPS failure, causing a switch reboot).  The same process crashed, with the same stack trace.  When that happened, I ran sudo service ceph restart on the machine with the crashed mon, and everything started up fine.  I haven't restarted anything this time.

I tried to recreate the problem by stopping and starting the VPN between the two locations, but that didn't trigger the crash.  I have some more ideas on how to trigger, I'll continue trying today.



arnulf@ceph0:~$ lsb_release -a
Distributor ID:    Ubuntu
Description:    Ubuntu 12.04.2 LTS
Release:    12.04
Codename:    precise

arnulf@ceph0:~$ uname -a
Linux ceph0 3.5.0-27-generic #46~precise1-Ubuntu SMP Tue Mar 26 19:33:21 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

arnulf@ceph0:~$ cat /etc/apt/sources.list.d/ceph.list
deb http://ceph.com/debian-testing/ precise main

arnulf@ceph0:~$ ceph -v
ceph version 0.60 (f26f7a39021dbf440c28d6375222e21c94fe8e5c)


ceph-mon.log from the non-elected master, mon.b:
2013-04-13 07:57:39.445098 7fde958f4700  0 mon.b@1(peon).data_health(20) update_stats avail 85% total 17295768 used 1679152 avail 14738024
2013-04-13 07:58:35.150603 7fde950f3700  0 log [INF] : mon.b calling new monitor election
2013-04-13 07:58:35.150876 7fde950f3700  1 mon.b@1(electing).elector(20) init, last seen epoch 20
2013-04-13 07:58:39.445355 7fde958f4700  0 mon.b@1(electing).data_health(20) update_stats avail 85% total 17295768 used 1679152 avail 14738024
2013-04-13 07:58:40.192514 7fde958f4700  1 mon.b@1(electing).elector(21) init, last seen epoch 21
2013-04-13 07:58:43.748907 7fde93dee700  0 -- 192.168.22.62:6789/0 >> 192.168.2.62:6789/0 pipe(0x2c56500 sd=25 :6789 s=2 pgs=108 cs=1 l=0).fault, initiating reconnect
2013-04-13 07:58:43.786209 7fde93ff0700  0 -- 192.168.22.62:6789/0 >> 192.168.2.62:6789/0 pipe(0x2c56500 sd=8 :6789 s=1 pgs=108 cs=2 l=0).fault
2013-04-13 07:59:13.050245 7fde958f4700  1 mon.b@1(probing) e1 discarding message auth(proto 0 26 bytes epoch 1) v1 and sending client elsewhere; we are not in quorum
2013-04-13 07:59:13.050277 7fde958f4700  1 mon.b@1(probing) e1 discarding message auth(proto 0 34 bytes epoch 1) v1 and sending client elsewhere; we are not in quorum
2013-04-13 07:59:13.050285 7fde958f4700  1 mon.b@1(probing) e1 discarding message auth(proto 0 26 bytes epoch 1) v1 and sending client elsewhere; we are not in quorum

...

ceph-mon.log from the elected master, mon.a:
2013-04-13 07:57:41.756844 7f162be82700  0 mon.a@0(leader).data_health(20) update_stats avail 84% total 17295768 used 1797312 avail 14619864
2013-04-13 07:58:35.210875 7f162b681700  0 log [INF] : mon.a calling new monitor election
2013-04-13 07:58:35.211081 7f162b681700  1 mon.a@0(electing).elector(20) init, last seen epoch 20
2013-04-13 07:58:40.270547 7f162be82700  1 mon.a@0(electing).elector(21) init, last seen epoch 21
2013-04-13 07:58:41.757032 7f162be82700  0 mon.a@0(electing).data_health(20) update_stats avail 84% total 17295768 used 1797312 avail 14619864
2013-04-13 07:58:43.441306 7f162b681700  0 log [INF] : mon.a@0 won leader election with quorum 0,1
2013-04-13 07:58:43.560319 7f162b681700  0 log [INF] : pgmap v1684: 632 pgs: 632 active+clean; 9982 bytes data, 2079 MB used, 100266 MB / 102346 MB avail; 0B/s rd, 0B/s wr, 0op/s
2013-04-13 07:58:43.561722 7f162b681700 -1 mon/PaxosService.cc: In function 'void PaxosService::propose_pending()' thread 7f162b681700 time 2013-04-13 07:58:43.560456
mon/PaxosService.cc: 127: FAILED assert(have_pending)

 ceph version 0.60 (f26f7a39021dbf440c28d6375222e21c94fe8e5c)
 1: (PaxosService::propose_pending()+0x46d) [0x4dee3d]
 2: (MDSMonitor::tick()+0x1c62) [0x51cdd2]
 3: (MDSMonitor::on_active()+0x1a) [0x512ada]
 4: (PaxosService::_active()+0x31d) [0x4e067d]
 5: (Context::complete(int)+0xa) [0x4b7b4a]
 6: (finish_contexts(CephContext*, std::list<Context*, std::allocator<Context*> >&, int)+0x95) [0x4ba5a5]
 7: (Paxos::handle_last(MMonPaxos*)+0xbef) [0x4da92f]
 8: (Paxos::dispatch(PaxosServiceMessage*)+0x26b) [0x4dad8b]
 9: (Monitor::_ms_dispatch(Message*)+0x149f) [0x4b310f]
 10: (Monitor::ms_dispatch(Message*)+0x32) [0x4c9d12]
 11: (DispatchQueue::entry()+0x341) [0x698da1]
 12: (DispatchQueue::DispatchThread::entry()+0xd) [0x626c5d]
 13: (()+0x7e9a) [0x7f1630c76e9a]
 14: (clone()+0x6d) [0x7f162f911cbd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.


I saved the log files from both events, in case you want the --- begin dump of recent events ---. 

This looks somewhat similiar to http://tracker.ceph.com/issues/3495.  The first 6 lines of the stack trace are the same functions, but they diverge at line 7.


--

Craig Lewis
Senior Systems Engineer
Office +1.714.602.1309
Email clewis@xxxxxxxxxxxxxxxxxx

Central Desktop. Work together in ways you never thought possible.
Connect with us   Website  |  Twitter  |  Facebook  |  LinkedIn  |  Blog



_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux