Re: Active MON aborts on Jewel 10.2.2 with FAILED assert(info.state == MDSMap::STATE_STANDBY

Bill Sharer <bsharer@xxxxxxxxxxxxxx> · Wed, 6 Jul 2016 14:16:37 -0400

I noticed on that USE list that the 10.2.2 ebuild introduced a new 
cephfs emerge flag, so I enabled that and emerged everywhere again.  The 
active mon is still crashing on the assertion though.

Bill Sharer

On 07/05/2016 08:14 PM, Bill Sharer wrote:
Relevant USE flags FWIW....

# emerge -pv ceph

These are the packages that would be merged, in order:

Calculating dependencies... done!
[ebuild   R   ~] sys-cluster/ceph-10.2.2::gentoo  USE="fuse gtk 
jemalloc ldap libaio libatomic nss radosgw static-libs xfs -babeltrace 
-cephfs -cryptopp -debug -lttng -tcmalloc {-test} -zfs" 
PYTHON_TARGETS="python2_7 python3_4 -python3_5" 11,271 KiB

Bill Sharer

On 07/05/2016 01:45 PM, Gregory Farnum wrote:
Thanks for the report; created a ticket and somebody will get on it
shortly. http://tracker.ceph.com/issues/16592
-Greg

On Sun, Jul 3, 2016 at 5:55 PM, Bill Sharer <bsharer@xxxxxxxxxxxxxx> 
wrote:
I was working on  a rolling upgrade on Gentoo to Jewel 10.2.2 from 
10.2.0.
However now I can't get a monitor quorum going again because as soon 
as I
get one, the mon which wins the election blows out with an assertion
failure.  Here's my status at the moment

kroll1    10.2.2    ceph mon.0 and ceph osd.0 normally my lead mon
kroll2    10.2.2    ceph mon 1 and ceph osd 2
kroll3    10.2.2    ceph osd 1
kroll4    10.2.2    ceph mon 3 and ceph osd 3
kroll5    10.2.2    ceph mon 4 and ceph mds 2 normally my active mds
kroll6    10.2.0    ceph mon 5 and ceph mds B normally standby mds

I had done rolling upgrade of everything but kroll6 and had rebooted 
the
first three osd and mon servers.  mds 2 went down during gentoo 
update of
kroll5 because of memory scarcity so mds B was the active mds 
server.  After
rebooting kroll4 I found that mon 0 had gone done with the assertion
failure.  I ended up stopping all ceph processes but desktops with 
client
mounts were all still up for the moment and basically would be stuck on
locks if I tried to access cephfs.

After trying to restart mons only beginning with mon 0 initially, the
following happened to mon.0 after enough mons were up for a quorum:

2016-07-03 16:34:26.555728 7fbff22f8480  1 leveldb: Recovering log 
#2592390
2016-07-03 16:34:26.555762 7fbff22f8480  1 leveldb: Level-0 table 
#2592397:
started
2016-07-03 16:34:26.558788 7fbff22f8480  1 leveldb: Level-0 table 
#2592397:
192 bytes OK
2016-07-03 16:34:26.562263 7fbff22f8480  1 leveldb: Delete type=3 
#2592388

2016-07-03 16:34:26.562364 7fbff22f8480  1 leveldb: Delete type=0 
#2592390

2016-07-03 16:34:26.563126 7fbff22f8480 -1 wrote monmap to
/etc/ceph/tmpmonmap
2016-07-03 17:09:25.753729 7f8291dff480  0 ceph version 10.2.2
(45107e21c568dd033c2f0a3107dec8f0b0e58374), pro
cess ceph-mon, pid 20842
2016-07-03 17:09:25.762588 7f8291dff480  1 leveldb: Recovering log 
#2592398
2016-07-03 17:09:25.767722 7f8291dff480  1 leveldb: Delete type=0 
#2592398

2016-07-03 17:09:25.767803 7f8291dff480  1 leveldb: Delete type=3 
#2592396

2016-07-03 17:09:25.768600 7f8291dff480  0 starting mon.0 rank 0 at
192.168.2.1:6789/0 mon_data /var/lib/ceph/mon/ceph-0 fsid
1798897a-f0c9-422d-86b3-d4933a12c7ac
2016-07-03 17:09:25.769066 7f8291dff480  1 mon.0@-1(probing) e10 
preinit
fsid 1798897a-f0c9-422d-86b3-d4933a12c7ac
2016-07-03 17:09:25.769923 7f8291dff480  1
mon.0@-1(probing).paxosservice(pgmap 17869652..17870289) refresh 
upgraded,
format 0 -> 1
2016-07-03 17:09:25.769947 7f8291dff480  1 mon.0@-1(probing).pg v0
on_upgrade discarding in-core PGMap
2016-07-03 17:09:25.776148 7f8291dff480  0 mon.0@-1(probing).mds e1532
print_map
e1532
enable_multiple, ever_enabled_multiple: 0,0
compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable
ranges,3=default file layouts on dirs,4=dir inode in separate 
object,5=mds
uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table}

Filesystem 'cephfs' (0)
fs_name cephfs
epoch   1530
flags   0
modified        2016-05-19 01:21:31.953710
tableserver     0
root    0
session_timeout 60
session_autoclose       300
max_file_size   1099511627776
last_failure    1478
last_failure_osd_epoch  26431
compat  compat={},rocompat={},incompat={1=base v0.20,2=client writeable
ranges,3=default file layouts on dirs,4=dir inode in separate 
object,5=mds
uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table}
max_mds 1
in      0
up      {0=1190233}
failed
damaged
stopped
data_pools      0
metadata_pool   1
inline_data     disabled
1190233:        192.168.2.6:6800/5437 'B' mds.0.1526 up:active seq 
103145

Standby daemons:

1190222:        192.168.2.5:6801/5871 '2' mds.-1.0 up:standby seq 
135114

2016-07-03 17:09:25.776444 7f8291dff480  0 mon.0@-1(probing).osd e26460
crush map has features 2200130813952, adjusting msgr requires
2016-07-03 17:09:25.776450 7f8291dff480  0 mon.0@-1(probing).osd e26460
crush map has features 2200130813952, adjusting msgr requires
2016-07-03 17:09:25.776453 7f8291dff480  0 mon.0@-1(probing).osd e26460
crush map has features 2200130813952, adjusting msgr requires
2016-07-03 17:09:25.776454 7f8291dff480  0 mon.0@-1(probing).osd e26460
crush map has features 2200130813952, adjusting msgr requires
2016-07-03 17:09:25.776696 7f8291dff480  1
mon.0@-1(probing).paxosservice(auth 19251..19344) refresh upgraded,
  format 0 -> 1
2016-07-03 17:09:25.777650 7f8291dff480  0 mon.0@-1(probing) e10  my 
rank is
now 0 (was -1)
2016-07-03 17:09:25.778191 7f8281e89700  0 -- 192.168.2.1:6789/0 >>
192.168.2.5:6789/0 pipe(0x556e0ab1d360 sd=15 :0 s=1 pgs=0 cs=0 l=0
c=0x556e0ab1e620).fault
2016-07-03 17:09:25.778257 7f828208b700  0 -- 192.168.2.1:6789/0 >>
192.168.2.2:6789/0 pipe(0x556e0ab17c80 sd=13 :0 s=1 pgs=0 cs=0 l=0
c=0x556e0ab18f40).fault
2016-07-03 17:09:25.778314 7f8281d88700  0 -- 192.168.2.1:6789/0 >>
192.168.2.6:6789/0 pipe(0x556e0ab1fe50 sd=16 :0 s=1 pgs=0 cs=0 l=0
c=0x556e0ab21160).fault
2016-07-03 17:09:25.778364 7f8281f8a700  0 -- 192.168.2.1:6789/0 >>
192.168.2.4:6789/0 pipe(0x556e0ab1a7e0 sd=14 :0 s=1 pgs=0 cs=0 l=0
c=0x556e0ab1baa0).fault
2016-07-03 17:10:25.777442 7f828408f700  0 
mon.0@0(probing).data_health(0)
update_stats avail 71% total 187 GB, used 44468 MB, avail 134 GB
2016-07-03 17:10:26.151363 7f8281486700  0 -- 192.168.2.1:6789/0 >>
192.168.2.6:0/3148669853 pipe(0x7f82640008c0 sd=25 :6789 s=0 pgs=0 
cs=0 l=0
c=0x7f8264001bd0).accept peer addr is really 192.168.2.6:0/3148669853
(socket is 192.168.2.6:37584/0)
2016-07-03 17:10:26.543059 7f8281284700  0 -- 192.168.2.1:6789/0 >>
192.168.2.1:0/2736937026 pipe(0x7f8264002f40 sd=26 :6789 s=0 pgs=0 
cs=0 l=0
c=0x7f8264004250).accept peer addr is really 192.168.2.1:0/2736937026
(socket is 192.168.2.1:38419/0)
2016-07-03 17:10:36.543026 7f8281284700  0 -- 192.168.2.1:6789/0 >>
192.168.2.1:0/2736937026 pipe(0x7f82640008c0 sd=25 :6789 s=0 pgs=0 
cs=0 l=0
c=0x7f8264001bd0).accept peer addr is really 192.168.2.1:0/2736937026
(socket is 192.168.2.1:38437/0)
2016-07-03 17:10:46.151145 7f8281284700  0 -- 192.168.2.1:6789/0 >>
192.168.2.6:0/3148669853 pipe(0x7f82640008c0 sd=25 :6789 s=0 pgs=0 
cs=0 l=0
c=0x7f8264001bd0).accept peer addr is really 192.168.2.6:0/3148669853
(socket is 192.168.2.6:37586/0)
2016-07-03 17:10:56.543087 7f8281284700  0 -- 192.168.2.1:6789/0 >>
192.168.2.1:0/2736937026 pipe(0x7f82640008c0 sd=25 :6789 s=0 pgs=0 
cs=0 l=0
c=0x7f8264001bd0).accept peer addr is really 192.168.2.1:0/2736937026
(socket is 192.168.2.1:38475/0)
2016-07-03 17:11:06.542941 7f8281082700  0 -- 192.168.2.1:6789/0 >>
192.168.2.1:0/2736937026 pipe(0x7f82640008c0 sd=25 :6789 s=0 pgs=0 
cs=0 l=0
c=0x7f8264001bd0).accept peer addr is really 192.168.2.1:0/2736937026
(socket is 192.168.2.1:38494/0)
2016-07-03 17:11:25.777920 7f828408f700  0 
mon.0@0(probing).data_health(0)
update_stats avail 71% total 187 GB, used 44468 MB, avail 134 GB
2016-07-03 17:11:36.543002 7f8281284700  0 -- 192.168.2.1:6789/0 >>
192.168.2.1:0/2736937026 pipe(0x7f82640008c0 sd=25 :6789 s=0 pgs=0 
cs=0 l=0
c=0x7f8264001bd0).accept peer addr is really 192.168.2.1:0/2736937026
(socket is 192.168.2.1:38556/0)
2016-07-03 17:11:41.382221 7f8281284700  0 -- 192.168.2.1:6789/0 >>
192.168.2.2:0/976433073 pipe(0x7f82640008c0 sd=25 :6789 s=0 pgs=0 
cs=0 l=0
c=0x7f8264001bd0).accept peer addr is really 192.168.2.2:0/976433073 
(socket
is 192.168.2.2:58252/0)
2016-07-03 17:11:51.382170 7f8281284700  0 -- 192.168.2.1:6789/0 >>
192.168.2.2:0/976433073 pipe(0x7f82640008c0 sd=25 :6789 s=0 pgs=0 
cs=0 l=0
c=0x7f8264001bd0).accept peer addr is really 192.168.2.2:0/976433073 
(socket
is 192.168.2.2:58253/0)
2016-07-03 17:12:06.543010 7f8281284700  0 -- 192.168.2.1:6789/0 >>
192.168.2.1:0/2736937026 pipe(0x7f82640008c0 sd=25 :6789 s=0 pgs=0 
cs=0 l=0
c=0x7f8264001bd0).accept peer addr is really 192.168.2.1:0/2736937026
(socket is 192.168.2.1:38602/0)
2016-07-03 17:12:16.543013 7f8280f81700  0 -- 192.168.2.1:6789/0 >>
192.168.2.1:0/2736937026 pipe(0x7f82640008c0 sd=25 :6789 s=0 pgs=0 
cs=0 l=0
c=0x7f8264001bd0).accept peer addr is really 192.168.2.1:0/2736937026
(socket is 192.168.2.1:38618/0)
2016-07-03 17:12:25.778424 7f828408f700  0 
mon.0@0(probing).data_health(0)
update_stats avail 71% total 187 GB, used 44468 MB, avail 134 GB
2016-07-03 17:12:36.151275 7f8280f81700  0 -- 192.168.2.1:6789/0 >>
192.168.2.6:0/3148669853 pipe(0x7f82640008c0 sd=25 :6789 s=0 pgs=0 
cs=0 l=0
c=0x7f8264001bd0).accept peer addr is really 192.168.2.6:0/3148669853
(socket is 192.168.2.6:37615/0)
2016-07-03 17:12:46.151131 7f8280f81700  0 -- 192.168.2.1:6789/0 >>
192.168.2.6:0/3148669853 pipe(0x7f82640008c0 sd=25 :6789 s=0 pgs=0 
cs=0 l=0
c=0x7f8264001bd0).accept peer addr is really 192.168.2.6:0/3148669853
(socket is 192.168.2.6:37627/0)
2016-07-03 17:12:50.102454 7f8281284700  0 -- 192.168.2.1:6789/0 >>
192.168.2.4:6789/0 pipe(0x7f82640055c0 sd=26 :6789 s=0 pgs=0 cs=0 l=0
c=0x7f8264006a80).accept connect_seq 0 vs existing 0 state connecting
2016-07-03 17:12:50.108055 7f828388e700  0 log_channel(cluster) log 
[INF] :
mon.0 calling new monitor election
2016-07-03 17:12:50.108235 7f828388e700  1 
mon.0@0(electing).elector(12184)
init, last seen epoch 12184
2016-07-03 17:12:50.116334 7f828388e700  0 log_channel(cluster) log 
[INF] :
mon.0 calling new monitor election
2016-07-03 17:12:50.116402 7f828388e700  1 
mon.0@0(electing).elector(12195)
init, last seen epoch 12195
2016-07-03 17:12:51.382215 7f8280f81700  0 -- 192.168.2.1:6789/0 >>
192.168.2.2:0/976433073 pipe(0x7f82640008c0 sd=25 :6789 s=0 pgs=0 
cs=0 l=0
c=0x7f8264001bd0).accept peer addr is really 192.168.2.2:0/976433073 
(socket
is 192.168.2.2:58259/0)
2016-07-03 17:12:55.120599 7f828408f700  0 log_channel(cluster) log 
[INF] :
mon.0@0 won leader election with quorum 0,2,3
2016-07-03 17:12:55.122490 7f828408f700  0 log_channel(cluster) log 
[INF] :
HEALTH_ERR; 187 pgs are stuck inac
tive for more than 300 seconds; 114 pgs degraded; 138 pgs peering; 
49 pgs
stale; 138 pgs stuck inactive; 49 pg
s stuck stale; 114 pgs stuck unclean; 114 pgs undersized; recovery
1019372/17575834 objects degraded (5.800%);
  too many PGs per OSD (449 > max 300); 1/4 in osds are down; noout 
flag(s)
set; 2 mons down, quorum 0,2,3 0,3,4
2016-07-03 17:12:55.193094 7f828388e700  0 log_channel(cluster) log 
[INF] :
monmap e10: 5 mons at
{0=192.168.2.1:6789/0,1=192.168.2.2:6789/0,3=192.168.2.4:6789/0,4=192.168.2.5:6789/0,5=192.168.2.6:6789/0} 

2016-07-03 17:12:55.193254 7f828388e700  0 log_channel(cluster) log 
[INF] :
pgmap v17870289: 768 pgs: 49 stale+active+clean, 114
active+undersized+degraded, 138 peering, 467 active+clean; 7128 GB 
data,
16620 GB used, 4824 GB / 22356 GB avail; 1019372/17575834 objects 
degraded
(5.800%)
2016-07-03 17:12:55.195553 7f828388e700 -1 mon/MDSMonitor.cc: In 
function
'bool 
MDSMonitor::maybe_promote_standby(std::shared_ptr<Filesystem>)' thread
7f828388e700 time 2016-07-03 17:12:55.193360
mon/MDSMonitor.cc: 2796: FAILED assert(info.state == 
MDSMap::STATE_STANDBY)

  ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x82) [0x556e001f1e12]
  2: 
(MDSMonitor::maybe_promote_standby(std::shared_ptr<Filesystem>)+0x97f)
[0x556dffede53f]
  3: (MDSMonitor::tick()+0x3b6) [0x556dffee0866]
  4: (MDSMonitor::on_active()+0x28) [0x556dffed9038]
  5: (PaxosService::_active()+0x66a) [0x556dffe5968a]
  6: (Context::complete(int)+0x9) [0x556dffe249a9]
  7: (void finish_contexts<Context>(CephContext*, std::list<Context*,
std::allocator<Context*> >&, int)+0xac) [0x556dffe2ba7c]
  8: (Paxos::finish_round()+0xd0) [0x556dffe50460]
  9: (Paxos::handle_last(std::shared_ptr<MonOpRequest>)+0x103d)
[0x556dffe51acd]
  10: (Paxos::dispatch(std::shared_ptr<MonOpRequest>)+0x38c) 
[0x556dffe5254c]
  11: (Monitor::dispatch_op(std::shared_ptr<MonOpRequest>)+0xd3b)
[0x556dffe2245b]
  12: (Monitor::_ms_dispatch(Message*)+0x581) [0x556dffe22b91]
  13: (Monitor::ms_dispatch(Message*)+0x23) [0x556dffe41393]
  14: (DispatchQueue::entry()+0x7ba) [0x556e002e722a]
  15: (DispatchQueue::DispatchThread::entry()+0xd) [0x556e001d62cd]
  16: (()+0x74a4) [0x7f8290f904a4]
  17: (clone()+0x6d) [0x7f828f29298d]
  NOTE: a copy of the executable, or `objdump -rdS <executable>` is 
needed to
interpret this.

If asked, I'll dump the rest of the log.

Bill Sharer
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com