Thanks for the report; created a ticket and somebody will get on it shortly. http://tracker.ceph.com/issues/16592 -Greg On Sun, Jul 3, 2016 at 5:55 PM, Bill Sharer <bsharer@xxxxxxxxxxxxxx> wrote: > I was working on a rolling upgrade on Gentoo to Jewel 10.2.2 from 10.2.0. > However now I can't get a monitor quorum going again because as soon as I > get one, the mon which wins the election blows out with an assertion > failure. Here's my status at the moment > > kroll1 10.2.2 ceph mon.0 and ceph osd.0 normally my lead mon > kroll2 10.2.2 ceph mon 1 and ceph osd 2 > kroll3 10.2.2 ceph osd 1 > kroll4 10.2.2 ceph mon 3 and ceph osd 3 > kroll5 10.2.2 ceph mon 4 and ceph mds 2 normally my active mds > kroll6 10.2.0 ceph mon 5 and ceph mds B normally standby mds > > I had done rolling upgrade of everything but kroll6 and had rebooted the > first three osd and mon servers. mds 2 went down during gentoo update of > kroll5 because of memory scarcity so mds B was the active mds server. After > rebooting kroll4 I found that mon 0 had gone done with the assertion > failure. I ended up stopping all ceph processes but desktops with client > mounts were all still up for the moment and basically would be stuck on > locks if I tried to access cephfs. > > After trying to restart mons only beginning with mon 0 initially, the > following happened to mon.0 after enough mons were up for a quorum: > > 2016-07-03 16:34:26.555728 7fbff22f8480 1 leveldb: Recovering log #2592390 > 2016-07-03 16:34:26.555762 7fbff22f8480 1 leveldb: Level-0 table #2592397: > started > 2016-07-03 16:34:26.558788 7fbff22f8480 1 leveldb: Level-0 table #2592397: > 192 bytes OK > 2016-07-03 16:34:26.562263 7fbff22f8480 1 leveldb: Delete type=3 #2592388 > > 2016-07-03 16:34:26.562364 7fbff22f8480 1 leveldb: Delete type=0 #2592390 > > 2016-07-03 16:34:26.563126 7fbff22f8480 -1 wrote monmap to > /etc/ceph/tmpmonmap > 2016-07-03 17:09:25.753729 7f8291dff480 0 ceph version 10.2.2 > (45107e21c568dd033c2f0a3107dec8f0b0e58374), pro > cess ceph-mon, pid 20842 > 2016-07-03 17:09:25.762588 7f8291dff480 1 leveldb: Recovering log #2592398 > 2016-07-03 17:09:25.767722 7f8291dff480 1 leveldb: Delete type=0 #2592398 > > 2016-07-03 17:09:25.767803 7f8291dff480 1 leveldb: Delete type=3 #2592396 > > 2016-07-03 17:09:25.768600 7f8291dff480 0 starting mon.0 rank 0 at > 192.168.2.1:6789/0 mon_data /var/lib/ceph/mon/ceph-0 fsid > 1798897a-f0c9-422d-86b3-d4933a12c7ac > 2016-07-03 17:09:25.769066 7f8291dff480 1 mon.0@-1(probing) e10 preinit > fsid 1798897a-f0c9-422d-86b3-d4933a12c7ac > 2016-07-03 17:09:25.769923 7f8291dff480 1 > mon.0@-1(probing).paxosservice(pgmap 17869652..17870289) refresh upgraded, > format 0 -> 1 > 2016-07-03 17:09:25.769947 7f8291dff480 1 mon.0@-1(probing).pg v0 > on_upgrade discarding in-core PGMap > 2016-07-03 17:09:25.776148 7f8291dff480 0 mon.0@-1(probing).mds e1532 > print_map > e1532 > enable_multiple, ever_enabled_multiple: 0,0 > compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable > ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds > uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table} > > Filesystem 'cephfs' (0) > fs_name cephfs > epoch 1530 > flags 0 > modified 2016-05-19 01:21:31.953710 > tableserver 0 > root 0 > session_timeout 60 > session_autoclose 300 > max_file_size 1099511627776 > last_failure 1478 > last_failure_osd_epoch 26431 > compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable > ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds > uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table} > max_mds 1 > in 0 > up {0=1190233} > failed > damaged > stopped > data_pools 0 > metadata_pool 1 > inline_data disabled > 1190233: 192.168.2.6:6800/5437 'B' mds.0.1526 up:active seq 103145 > > > Standby daemons: > > 1190222: 192.168.2.5:6801/5871 '2' mds.-1.0 up:standby seq 135114 > > 2016-07-03 17:09:25.776444 7f8291dff480 0 mon.0@-1(probing).osd e26460 > crush map has features 2200130813952, adjusting msgr requires > 2016-07-03 17:09:25.776450 7f8291dff480 0 mon.0@-1(probing).osd e26460 > crush map has features 2200130813952, adjusting msgr requires > 2016-07-03 17:09:25.776453 7f8291dff480 0 mon.0@-1(probing).osd e26460 > crush map has features 2200130813952, adjusting msgr requires > 2016-07-03 17:09:25.776454 7f8291dff480 0 mon.0@-1(probing).osd e26460 > crush map has features 2200130813952, adjusting msgr requires > 2016-07-03 17:09:25.776696 7f8291dff480 1 > mon.0@-1(probing).paxosservice(auth 19251..19344) refresh upgraded, > format 0 -> 1 > 2016-07-03 17:09:25.777650 7f8291dff480 0 mon.0@-1(probing) e10 my rank is > now 0 (was -1) > 2016-07-03 17:09:25.778191 7f8281e89700 0 -- 192.168.2.1:6789/0 >> > 192.168.2.5:6789/0 pipe(0x556e0ab1d360 sd=15 :0 s=1 pgs=0 cs=0 l=0 > c=0x556e0ab1e620).fault > 2016-07-03 17:09:25.778257 7f828208b700 0 -- 192.168.2.1:6789/0 >> > 192.168.2.2:6789/0 pipe(0x556e0ab17c80 sd=13 :0 s=1 pgs=0 cs=0 l=0 > c=0x556e0ab18f40).fault > 2016-07-03 17:09:25.778314 7f8281d88700 0 -- 192.168.2.1:6789/0 >> > 192.168.2.6:6789/0 pipe(0x556e0ab1fe50 sd=16 :0 s=1 pgs=0 cs=0 l=0 > c=0x556e0ab21160).fault > 2016-07-03 17:09:25.778364 7f8281f8a700 0 -- 192.168.2.1:6789/0 >> > 192.168.2.4:6789/0 pipe(0x556e0ab1a7e0 sd=14 :0 s=1 pgs=0 cs=0 l=0 > c=0x556e0ab1baa0).fault > 2016-07-03 17:10:25.777442 7f828408f700 0 mon.0@0(probing).data_health(0) > update_stats avail 71% total 187 GB, used 44468 MB, avail 134 GB > 2016-07-03 17:10:26.151363 7f8281486700 0 -- 192.168.2.1:6789/0 >> > 192.168.2.6:0/3148669853 pipe(0x7f82640008c0 sd=25 :6789 s=0 pgs=0 cs=0 l=0 > c=0x7f8264001bd0).accept peer addr is really 192.168.2.6:0/3148669853 > (socket is 192.168.2.6:37584/0) > 2016-07-03 17:10:26.543059 7f8281284700 0 -- 192.168.2.1:6789/0 >> > 192.168.2.1:0/2736937026 pipe(0x7f8264002f40 sd=26 :6789 s=0 pgs=0 cs=0 l=0 > c=0x7f8264004250).accept peer addr is really 192.168.2.1:0/2736937026 > (socket is 192.168.2.1:38419/0) > 2016-07-03 17:10:36.543026 7f8281284700 0 -- 192.168.2.1:6789/0 >> > 192.168.2.1:0/2736937026 pipe(0x7f82640008c0 sd=25 :6789 s=0 pgs=0 cs=0 l=0 > c=0x7f8264001bd0).accept peer addr is really 192.168.2.1:0/2736937026 > (socket is 192.168.2.1:38437/0) > 2016-07-03 17:10:46.151145 7f8281284700 0 -- 192.168.2.1:6789/0 >> > 192.168.2.6:0/3148669853 pipe(0x7f82640008c0 sd=25 :6789 s=0 pgs=0 cs=0 l=0 > c=0x7f8264001bd0).accept peer addr is really 192.168.2.6:0/3148669853 > (socket is 192.168.2.6:37586/0) > 2016-07-03 17:10:56.543087 7f8281284700 0 -- 192.168.2.1:6789/0 >> > 192.168.2.1:0/2736937026 pipe(0x7f82640008c0 sd=25 :6789 s=0 pgs=0 cs=0 l=0 > c=0x7f8264001bd0).accept peer addr is really 192.168.2.1:0/2736937026 > (socket is 192.168.2.1:38475/0) > 2016-07-03 17:11:06.542941 7f8281082700 0 -- 192.168.2.1:6789/0 >> > 192.168.2.1:0/2736937026 pipe(0x7f82640008c0 sd=25 :6789 s=0 pgs=0 cs=0 l=0 > c=0x7f8264001bd0).accept peer addr is really 192.168.2.1:0/2736937026 > (socket is 192.168.2.1:38494/0) > 2016-07-03 17:11:25.777920 7f828408f700 0 mon.0@0(probing).data_health(0) > update_stats avail 71% total 187 GB, used 44468 MB, avail 134 GB > 2016-07-03 17:11:36.543002 7f8281284700 0 -- 192.168.2.1:6789/0 >> > 192.168.2.1:0/2736937026 pipe(0x7f82640008c0 sd=25 :6789 s=0 pgs=0 cs=0 l=0 > c=0x7f8264001bd0).accept peer addr is really 192.168.2.1:0/2736937026 > (socket is 192.168.2.1:38556/0) > 2016-07-03 17:11:41.382221 7f8281284700 0 -- 192.168.2.1:6789/0 >> > 192.168.2.2:0/976433073 pipe(0x7f82640008c0 sd=25 :6789 s=0 pgs=0 cs=0 l=0 > c=0x7f8264001bd0).accept peer addr is really 192.168.2.2:0/976433073 (socket > is 192.168.2.2:58252/0) > 2016-07-03 17:11:51.382170 7f8281284700 0 -- 192.168.2.1:6789/0 >> > 192.168.2.2:0/976433073 pipe(0x7f82640008c0 sd=25 :6789 s=0 pgs=0 cs=0 l=0 > c=0x7f8264001bd0).accept peer addr is really 192.168.2.2:0/976433073 (socket > is 192.168.2.2:58253/0) > 2016-07-03 17:12:06.543010 7f8281284700 0 -- 192.168.2.1:6789/0 >> > 192.168.2.1:0/2736937026 pipe(0x7f82640008c0 sd=25 :6789 s=0 pgs=0 cs=0 l=0 > c=0x7f8264001bd0).accept peer addr is really 192.168.2.1:0/2736937026 > (socket is 192.168.2.1:38602/0) > 2016-07-03 17:12:16.543013 7f8280f81700 0 -- 192.168.2.1:6789/0 >> > 192.168.2.1:0/2736937026 pipe(0x7f82640008c0 sd=25 :6789 s=0 pgs=0 cs=0 l=0 > c=0x7f8264001bd0).accept peer addr is really 192.168.2.1:0/2736937026 > (socket is 192.168.2.1:38618/0) > 2016-07-03 17:12:25.778424 7f828408f700 0 mon.0@0(probing).data_health(0) > update_stats avail 71% total 187 GB, used 44468 MB, avail 134 GB > 2016-07-03 17:12:36.151275 7f8280f81700 0 -- 192.168.2.1:6789/0 >> > 192.168.2.6:0/3148669853 pipe(0x7f82640008c0 sd=25 :6789 s=0 pgs=0 cs=0 l=0 > c=0x7f8264001bd0).accept peer addr is really 192.168.2.6:0/3148669853 > (socket is 192.168.2.6:37615/0) > 2016-07-03 17:12:46.151131 7f8280f81700 0 -- 192.168.2.1:6789/0 >> > 192.168.2.6:0/3148669853 pipe(0x7f82640008c0 sd=25 :6789 s=0 pgs=0 cs=0 l=0 > c=0x7f8264001bd0).accept peer addr is really 192.168.2.6:0/3148669853 > (socket is 192.168.2.6:37627/0) > 2016-07-03 17:12:50.102454 7f8281284700 0 -- 192.168.2.1:6789/0 >> > 192.168.2.4:6789/0 pipe(0x7f82640055c0 sd=26 :6789 s=0 pgs=0 cs=0 l=0 > c=0x7f8264006a80).accept connect_seq 0 vs existing 0 state connecting > 2016-07-03 17:12:50.108055 7f828388e700 0 log_channel(cluster) log [INF] : > mon.0 calling new monitor election > 2016-07-03 17:12:50.108235 7f828388e700 1 mon.0@0(electing).elector(12184) > init, last seen epoch 12184 > 2016-07-03 17:12:50.116334 7f828388e700 0 log_channel(cluster) log [INF] : > mon.0 calling new monitor election > 2016-07-03 17:12:50.116402 7f828388e700 1 mon.0@0(electing).elector(12195) > init, last seen epoch 12195 > 2016-07-03 17:12:51.382215 7f8280f81700 0 -- 192.168.2.1:6789/0 >> > 192.168.2.2:0/976433073 pipe(0x7f82640008c0 sd=25 :6789 s=0 pgs=0 cs=0 l=0 > c=0x7f8264001bd0).accept peer addr is really 192.168.2.2:0/976433073 (socket > is 192.168.2.2:58259/0) > 2016-07-03 17:12:55.120599 7f828408f700 0 log_channel(cluster) log [INF] : > mon.0@0 won leader election with quorum 0,2,3 > 2016-07-03 17:12:55.122490 7f828408f700 0 log_channel(cluster) log [INF] : > HEALTH_ERR; 187 pgs are stuck inac > tive for more than 300 seconds; 114 pgs degraded; 138 pgs peering; 49 pgs > stale; 138 pgs stuck inactive; 49 pg > s stuck stale; 114 pgs stuck unclean; 114 pgs undersized; recovery > 1019372/17575834 objects degraded (5.800%); > too many PGs per OSD (449 > max 300); 1/4 in osds are down; noout flag(s) > set; 2 mons down, quorum 0,2,3 0,3,4 > 2016-07-03 17:12:55.193094 7f828388e700 0 log_channel(cluster) log [INF] : > monmap e10: 5 mons at > {0=192.168.2.1:6789/0,1=192.168.2.2:6789/0,3=192.168.2.4:6789/0,4=192.168.2.5:6789/0,5=192.168.2.6:6789/0} > 2016-07-03 17:12:55.193254 7f828388e700 0 log_channel(cluster) log [INF] : > pgmap v17870289: 768 pgs: 49 stale+active+clean, 114 > active+undersized+degraded, 138 peering, 467 active+clean; 7128 GB data, > 16620 GB used, 4824 GB / 22356 GB avail; 1019372/17575834 objects degraded > (5.800%) > 2016-07-03 17:12:55.195553 7f828388e700 -1 mon/MDSMonitor.cc: In function > 'bool MDSMonitor::maybe_promote_standby(std::shared_ptr<Filesystem>)' thread > 7f828388e700 time 2016-07-03 17:12:55.193360 > mon/MDSMonitor.cc: 2796: FAILED assert(info.state == MDSMap::STATE_STANDBY) > > ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374) > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x82) [0x556e001f1e12] > 2: (MDSMonitor::maybe_promote_standby(std::shared_ptr<Filesystem>)+0x97f) > [0x556dffede53f] > 3: (MDSMonitor::tick()+0x3b6) [0x556dffee0866] > 4: (MDSMonitor::on_active()+0x28) [0x556dffed9038] > 5: (PaxosService::_active()+0x66a) [0x556dffe5968a] > 6: (Context::complete(int)+0x9) [0x556dffe249a9] > 7: (void finish_contexts<Context>(CephContext*, std::list<Context*, > std::allocator<Context*> >&, int)+0xac) [0x556dffe2ba7c] > 8: (Paxos::finish_round()+0xd0) [0x556dffe50460] > 9: (Paxos::handle_last(std::shared_ptr<MonOpRequest>)+0x103d) > [0x556dffe51acd] > 10: (Paxos::dispatch(std::shared_ptr<MonOpRequest>)+0x38c) [0x556dffe5254c] > 11: (Monitor::dispatch_op(std::shared_ptr<MonOpRequest>)+0xd3b) > [0x556dffe2245b] > 12: (Monitor::_ms_dispatch(Message*)+0x581) [0x556dffe22b91] > 13: (Monitor::ms_dispatch(Message*)+0x23) [0x556dffe41393] > 14: (DispatchQueue::entry()+0x7ba) [0x556e002e722a] > 15: (DispatchQueue::DispatchThread::entry()+0xd) [0x556e001d62cd] > 16: (()+0x74a4) [0x7f8290f904a4] > 17: (clone()+0x6d) [0x7f828f29298d] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to > interpret this. > > If asked, I'll dump the rest of the log. > > > > Bill Sharer > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com