Re: Active MON aborts on Jewel 10.2.2 with FAILED assert(info.state == MDSMap::STATE_STANDBY

Gregory Farnum <gfarnum@xxxxxxxxxx> · Tue, 5 Jul 2016 10:45:54 -0700



Thanks for the report; created a ticket and somebody will get on it
shortly. http://tracker.ceph.com/issues/16592
-Greg

On Sun, Jul 3, 2016 at 5:55 PM, Bill Sharer <bsharer@xxxxxxxxxxxxxx> wrote:
> I was working on  a rolling upgrade on Gentoo to Jewel 10.2.2 from 10.2.0.
> However now I can't get a monitor quorum going again because as soon as I
> get one, the mon which wins the election blows out with an assertion
> failure.  Here's my status at the moment
>
> kroll1    10.2.2    ceph mon.0 and ceph osd.0 normally my lead mon
> kroll2    10.2.2    ceph mon 1 and ceph osd 2
> kroll3    10.2.2    ceph osd 1
> kroll4    10.2.2    ceph mon 3 and ceph osd 3
> kroll5    10.2.2    ceph mon 4 and ceph mds 2 normally my active mds
> kroll6    10.2.0    ceph mon 5 and ceph mds B normally standby mds
>
> I had done rolling upgrade of everything but kroll6 and had rebooted the
> first three osd and mon servers.  mds 2 went down during gentoo update of
> kroll5 because of memory scarcity so mds B was the active mds server.  After
> rebooting kroll4 I found that mon 0 had gone done with the assertion
> failure.  I ended up stopping all ceph processes but desktops with client
> mounts were all still up for the moment and basically would be stuck on
> locks if I tried to access cephfs.
>
> After trying to restart mons only beginning with mon 0 initially, the
> following happened to mon.0 after enough mons were up for a quorum:
>
> 2016-07-03 16:34:26.555728 7fbff22f8480  1 leveldb: Recovering log #2592390
> 2016-07-03 16:34:26.555762 7fbff22f8480  1 leveldb: Level-0 table #2592397:
> started
> 2016-07-03 16:34:26.558788 7fbff22f8480  1 leveldb: Level-0 table #2592397:
> 192 bytes OK
> 2016-07-03 16:34:26.562263 7fbff22f8480  1 leveldb: Delete type=3 #2592388
>
> 2016-07-03 16:34:26.562364 7fbff22f8480  1 leveldb: Delete type=0 #2592390
>
> 2016-07-03 16:34:26.563126 7fbff22f8480 -1 wrote monmap to
> /etc/ceph/tmpmonmap
> 2016-07-03 17:09:25.753729 7f8291dff480  0 ceph version 10.2.2
> (45107e21c568dd033c2f0a3107dec8f0b0e58374), pro
> cess ceph-mon, pid 20842
> 2016-07-03 17:09:25.762588 7f8291dff480  1 leveldb: Recovering log #2592398
> 2016-07-03 17:09:25.767722 7f8291dff480  1 leveldb: Delete type=0 #2592398
>
> 2016-07-03 17:09:25.767803 7f8291dff480  1 leveldb: Delete type=3 #2592396
>
> 2016-07-03 17:09:25.768600 7f8291dff480  0 starting mon.0 rank 0 at
> 192.168.2.1:6789/0 mon_data /var/lib/ceph/mon/ceph-0 fsid
> 1798897a-f0c9-422d-86b3-d4933a12c7ac
> 2016-07-03 17:09:25.769066 7f8291dff480  1 mon.0@-1(probing) e10 preinit
> fsid 1798897a-f0c9-422d-86b3-d4933a12c7ac
> 2016-07-03 17:09:25.769923 7f8291dff480  1
> mon.0@-1(probing).paxosservice(pgmap 17869652..17870289) refresh upgraded,
> format 0 -> 1
> 2016-07-03 17:09:25.769947 7f8291dff480  1 mon.0@-1(probing).pg v0
> on_upgrade discarding in-core PGMap
> 2016-07-03 17:09:25.776148 7f8291dff480  0 mon.0@-1(probing).mds e1532
> print_map
> e1532
> enable_multiple, ever_enabled_multiple: 0,0
> compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable
> ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds
> uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table}
>
> Filesystem 'cephfs' (0)
> fs_name cephfs
> epoch   1530
> flags   0
> modified        2016-05-19 01:21:31.953710
> tableserver     0
> root    0
> session_timeout 60
> session_autoclose       300
> max_file_size   1099511627776
> last_failure    1478
> last_failure_osd_epoch  26431
> compat  compat={},rocompat={},incompat={1=base v0.20,2=client writeable
> ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds
> uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table}
> max_mds 1
> in      0
> up      {0=1190233}
> failed
> damaged
> stopped
> data_pools      0
> metadata_pool   1
> inline_data     disabled
> 1190233:        192.168.2.6:6800/5437 'B' mds.0.1526 up:active seq 103145
>
>
> Standby daemons:
>
> 1190222:        192.168.2.5:6801/5871 '2' mds.-1.0 up:standby seq 135114
>
> 2016-07-03 17:09:25.776444 7f8291dff480  0 mon.0@-1(probing).osd e26460
> crush map has features 2200130813952, adjusting msgr requires
> 2016-07-03 17:09:25.776450 7f8291dff480  0 mon.0@-1(probing).osd e26460
> crush map has features 2200130813952, adjusting msgr requires
> 2016-07-03 17:09:25.776453 7f8291dff480  0 mon.0@-1(probing).osd e26460
> crush map has features 2200130813952, adjusting msgr requires
> 2016-07-03 17:09:25.776454 7f8291dff480  0 mon.0@-1(probing).osd e26460
> crush map has features 2200130813952, adjusting msgr requires
> 2016-07-03 17:09:25.776696 7f8291dff480  1
> mon.0@-1(probing).paxosservice(auth 19251..19344) refresh upgraded,
>  format 0 -> 1
> 2016-07-03 17:09:25.777650 7f8291dff480  0 mon.0@-1(probing) e10  my rank is
> now 0 (was -1)
> 2016-07-03 17:09:25.778191 7f8281e89700  0 -- 192.168.2.1:6789/0 >>
> 192.168.2.5:6789/0 pipe(0x556e0ab1d360 sd=15 :0 s=1 pgs=0 cs=0 l=0
> c=0x556e0ab1e620).fault
> 2016-07-03 17:09:25.778257 7f828208b700  0 -- 192.168.2.1:6789/0 >>
> 192.168.2.2:6789/0 pipe(0x556e0ab17c80 sd=13 :0 s=1 pgs=0 cs=0 l=0
> c=0x556e0ab18f40).fault
> 2016-07-03 17:09:25.778314 7f8281d88700  0 -- 192.168.2.1:6789/0 >>
> 192.168.2.6:6789/0 pipe(0x556e0ab1fe50 sd=16 :0 s=1 pgs=0 cs=0 l=0
> c=0x556e0ab21160).fault
> 2016-07-03 17:09:25.778364 7f8281f8a700  0 -- 192.168.2.1:6789/0 >>
> 192.168.2.4:6789/0 pipe(0x556e0ab1a7e0 sd=14 :0 s=1 pgs=0 cs=0 l=0
> c=0x556e0ab1baa0).fault
> 2016-07-03 17:10:25.777442 7f828408f700  0 mon.0@0(probing).data_health(0)
> update_stats avail 71% total 187 GB, used 44468 MB, avail 134 GB
> 2016-07-03 17:10:26.151363 7f8281486700  0 -- 192.168.2.1:6789/0 >>
> 192.168.2.6:0/3148669853 pipe(0x7f82640008c0 sd=25 :6789 s=0 pgs=0 cs=0 l=0
> c=0x7f8264001bd0).accept peer addr is really 192.168.2.6:0/3148669853
> (socket is 192.168.2.6:37584/0)
> 2016-07-03 17:10:26.543059 7f8281284700  0 -- 192.168.2.1:6789/0 >>
> 192.168.2.1:0/2736937026 pipe(0x7f8264002f40 sd=26 :6789 s=0 pgs=0 cs=0 l=0
> c=0x7f8264004250).accept peer addr is really 192.168.2.1:0/2736937026
> (socket is 192.168.2.1:38419/0)
> 2016-07-03 17:10:36.543026 7f8281284700  0 -- 192.168.2.1:6789/0 >>
> 192.168.2.1:0/2736937026 pipe(0x7f82640008c0 sd=25 :6789 s=0 pgs=0 cs=0 l=0
> c=0x7f8264001bd0).accept peer addr is really 192.168.2.1:0/2736937026
> (socket is 192.168.2.1:38437/0)
> 2016-07-03 17:10:46.151145 7f8281284700  0 -- 192.168.2.1:6789/0 >>
> 192.168.2.6:0/3148669853 pipe(0x7f82640008c0 sd=25 :6789 s=0 pgs=0 cs=0 l=0
> c=0x7f8264001bd0).accept peer addr is really 192.168.2.6:0/3148669853
> (socket is 192.168.2.6:37586/0)
> 2016-07-03 17:10:56.543087 7f8281284700  0 -- 192.168.2.1:6789/0 >>
> 192.168.2.1:0/2736937026 pipe(0x7f82640008c0 sd=25 :6789 s=0 pgs=0 cs=0 l=0
> c=0x7f8264001bd0).accept peer addr is really 192.168.2.1:0/2736937026
> (socket is 192.168.2.1:38475/0)
> 2016-07-03 17:11:06.542941 7f8281082700  0 -- 192.168.2.1:6789/0 >>
> 192.168.2.1:0/2736937026 pipe(0x7f82640008c0 sd=25 :6789 s=0 pgs=0 cs=0 l=0
> c=0x7f8264001bd0).accept peer addr is really 192.168.2.1:0/2736937026
> (socket is 192.168.2.1:38494/0)
> 2016-07-03 17:11:25.777920 7f828408f700  0 mon.0@0(probing).data_health(0)
> update_stats avail 71% total 187 GB, used 44468 MB, avail 134 GB
> 2016-07-03 17:11:36.543002 7f8281284700  0 -- 192.168.2.1:6789/0 >>
> 192.168.2.1:0/2736937026 pipe(0x7f82640008c0 sd=25 :6789 s=0 pgs=0 cs=0 l=0
> c=0x7f8264001bd0).accept peer addr is really 192.168.2.1:0/2736937026
> (socket is 192.168.2.1:38556/0)
> 2016-07-03 17:11:41.382221 7f8281284700  0 -- 192.168.2.1:6789/0 >>
> 192.168.2.2:0/976433073 pipe(0x7f82640008c0 sd=25 :6789 s=0 pgs=0 cs=0 l=0
> c=0x7f8264001bd0).accept peer addr is really 192.168.2.2:0/976433073 (socket
> is 192.168.2.2:58252/0)
> 2016-07-03 17:11:51.382170 7f8281284700  0 -- 192.168.2.1:6789/0 >>
> 192.168.2.2:0/976433073 pipe(0x7f82640008c0 sd=25 :6789 s=0 pgs=0 cs=0 l=0
> c=0x7f8264001bd0).accept peer addr is really 192.168.2.2:0/976433073 (socket
> is 192.168.2.2:58253/0)
> 2016-07-03 17:12:06.543010 7f8281284700  0 -- 192.168.2.1:6789/0 >>
> 192.168.2.1:0/2736937026 pipe(0x7f82640008c0 sd=25 :6789 s=0 pgs=0 cs=0 l=0
> c=0x7f8264001bd0).accept peer addr is really 192.168.2.1:0/2736937026
> (socket is 192.168.2.1:38602/0)
> 2016-07-03 17:12:16.543013 7f8280f81700  0 -- 192.168.2.1:6789/0 >>
> 192.168.2.1:0/2736937026 pipe(0x7f82640008c0 sd=25 :6789 s=0 pgs=0 cs=0 l=0
> c=0x7f8264001bd0).accept peer addr is really 192.168.2.1:0/2736937026
> (socket is 192.168.2.1:38618/0)
> 2016-07-03 17:12:25.778424 7f828408f700  0 mon.0@0(probing).data_health(0)
> update_stats avail 71% total 187 GB, used 44468 MB, avail 134 GB
> 2016-07-03 17:12:36.151275 7f8280f81700  0 -- 192.168.2.1:6789/0 >>
> 192.168.2.6:0/3148669853 pipe(0x7f82640008c0 sd=25 :6789 s=0 pgs=0 cs=0 l=0
> c=0x7f8264001bd0).accept peer addr is really 192.168.2.6:0/3148669853
> (socket is 192.168.2.6:37615/0)
> 2016-07-03 17:12:46.151131 7f8280f81700  0 -- 192.168.2.1:6789/0 >>
> 192.168.2.6:0/3148669853 pipe(0x7f82640008c0 sd=25 :6789 s=0 pgs=0 cs=0 l=0
> c=0x7f8264001bd0).accept peer addr is really 192.168.2.6:0/3148669853
> (socket is 192.168.2.6:37627/0)
> 2016-07-03 17:12:50.102454 7f8281284700  0 -- 192.168.2.1:6789/0 >>
> 192.168.2.4:6789/0 pipe(0x7f82640055c0 sd=26 :6789 s=0 pgs=0 cs=0 l=0
> c=0x7f8264006a80).accept connect_seq 0 vs existing 0 state connecting
> 2016-07-03 17:12:50.108055 7f828388e700  0 log_channel(cluster) log [INF] :
> mon.0 calling new monitor election
> 2016-07-03 17:12:50.108235 7f828388e700  1 mon.0@0(electing).elector(12184)
> init, last seen epoch 12184
> 2016-07-03 17:12:50.116334 7f828388e700  0 log_channel(cluster) log [INF] :
> mon.0 calling new monitor election
> 2016-07-03 17:12:50.116402 7f828388e700  1 mon.0@0(electing).elector(12195)
> init, last seen epoch 12195
> 2016-07-03 17:12:51.382215 7f8280f81700  0 -- 192.168.2.1:6789/0 >>
> 192.168.2.2:0/976433073 pipe(0x7f82640008c0 sd=25 :6789 s=0 pgs=0 cs=0 l=0
> c=0x7f8264001bd0).accept peer addr is really 192.168.2.2:0/976433073 (socket
> is 192.168.2.2:58259/0)
> 2016-07-03 17:12:55.120599 7f828408f700  0 log_channel(cluster) log [INF] :
> mon.0@0 won leader election with quorum 0,2,3
> 2016-07-03 17:12:55.122490 7f828408f700  0 log_channel(cluster) log [INF] :
> HEALTH_ERR; 187 pgs are stuck inac
> tive for more than 300 seconds; 114 pgs degraded; 138 pgs peering; 49 pgs
> stale; 138 pgs stuck inactive; 49 pg
> s stuck stale; 114 pgs stuck unclean; 114 pgs undersized; recovery
> 1019372/17575834 objects degraded (5.800%);
>  too many PGs per OSD (449 > max 300); 1/4 in osds are down; noout flag(s)
> set; 2 mons down, quorum 0,2,3 0,3,4
> 2016-07-03 17:12:55.193094 7f828388e700  0 log_channel(cluster) log [INF] :
> monmap e10: 5 mons at
> {0=192.168.2.1:6789/0,1=192.168.2.2:6789/0,3=192.168.2.4:6789/0,4=192.168.2.5:6789/0,5=192.168.2.6:6789/0}
> 2016-07-03 17:12:55.193254 7f828388e700  0 log_channel(cluster) log [INF] :
> pgmap v17870289: 768 pgs: 49 stale+active+clean, 114
> active+undersized+degraded, 138 peering, 467 active+clean; 7128 GB data,
> 16620 GB used, 4824 GB / 22356 GB avail; 1019372/17575834 objects degraded
> (5.800%)
> 2016-07-03 17:12:55.195553 7f828388e700 -1 mon/MDSMonitor.cc: In function
> 'bool MDSMonitor::maybe_promote_standby(std::shared_ptr<Filesystem>)' thread
> 7f828388e700 time 2016-07-03 17:12:55.193360
> mon/MDSMonitor.cc: 2796: FAILED assert(info.state == MDSMap::STATE_STANDBY)
>
>  ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x82) [0x556e001f1e12]
>  2: (MDSMonitor::maybe_promote_standby(std::shared_ptr<Filesystem>)+0x97f)
> [0x556dffede53f]
>  3: (MDSMonitor::tick()+0x3b6) [0x556dffee0866]
>  4: (MDSMonitor::on_active()+0x28) [0x556dffed9038]
>  5: (PaxosService::_active()+0x66a) [0x556dffe5968a]
>  6: (Context::complete(int)+0x9) [0x556dffe249a9]
>  7: (void finish_contexts<Context>(CephContext*, std::list<Context*,
> std::allocator<Context*> >&, int)+0xac) [0x556dffe2ba7c]
>  8: (Paxos::finish_round()+0xd0) [0x556dffe50460]
>  9: (Paxos::handle_last(std::shared_ptr<MonOpRequest>)+0x103d)
> [0x556dffe51acd]
>  10: (Paxos::dispatch(std::shared_ptr<MonOpRequest>)+0x38c) [0x556dffe5254c]
>  11: (Monitor::dispatch_op(std::shared_ptr<MonOpRequest>)+0xd3b)
> [0x556dffe2245b]
>  12: (Monitor::_ms_dispatch(Message*)+0x581) [0x556dffe22b91]
>  13: (Monitor::ms_dispatch(Message*)+0x23) [0x556dffe41393]
>  14: (DispatchQueue::entry()+0x7ba) [0x556e002e722a]
>  15: (DispatchQueue::DispatchThread::entry()+0xd) [0x556e001d62cd]
>  16: (()+0x74a4) [0x7f8290f904a4]
>  17: (clone()+0x6d) [0x7f828f29298d]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
> interpret this.
>
> If asked, I'll dump the rest of the log.
>
>
>
> Bill Sharer
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com