I was working on a rolling upgrade on Gentoo to Jewel 10.2.2 from
10.2.0.
However now I can't get a monitor quorum going again because as soon
as I
get one, the mon which wins the election blows out with an assertion
failure. Here's my status at the moment
kroll1 10.2.2 ceph mon.0 and ceph osd.0 normally my lead mon
kroll2 10.2.2 ceph mon 1 and ceph osd 2
kroll3 10.2.2 ceph osd 1
kroll4 10.2.2 ceph mon 3 and ceph osd 3
kroll5 10.2.2 ceph mon 4 and ceph mds 2 normally my active mds
kroll6 10.2.0 ceph mon 5 and ceph mds B normally standby mds
I had done rolling upgrade of everything but kroll6 and had rebooted
the
first three osd and mon servers. mds 2 went down during gentoo
update of
kroll5 because of memory scarcity so mds B was the active mds
server. After
rebooting kroll4 I found that mon 0 had gone done with the assertion
failure. I ended up stopping all ceph processes but desktops with
client
mounts were all still up for the moment and basically would be stuck on
locks if I tried to access cephfs.
After trying to restart mons only beginning with mon 0 initially, the
following happened to mon.0 after enough mons were up for a quorum:
2016-07-03 16:34:26.555728 7fbff22f8480 1 leveldb: Recovering log
#2592390
2016-07-03 16:34:26.555762 7fbff22f8480 1 leveldb: Level-0 table
#2592397:
started
2016-07-03 16:34:26.558788 7fbff22f8480 1 leveldb: Level-0 table
#2592397:
192 bytes OK
2016-07-03 16:34:26.562263 7fbff22f8480 1 leveldb: Delete type=3
#2592388
2016-07-03 16:34:26.562364 7fbff22f8480 1 leveldb: Delete type=0
#2592390
2016-07-03 16:34:26.563126 7fbff22f8480 -1 wrote monmap to
/etc/ceph/tmpmonmap
2016-07-03 17:09:25.753729 7f8291dff480 0 ceph version 10.2.2
(45107e21c568dd033c2f0a3107dec8f0b0e58374), pro
cess ceph-mon, pid 20842
2016-07-03 17:09:25.762588 7f8291dff480 1 leveldb: Recovering log
#2592398
2016-07-03 17:09:25.767722 7f8291dff480 1 leveldb: Delete type=0
#2592398
2016-07-03 17:09:25.767803 7f8291dff480 1 leveldb: Delete type=3
#2592396
2016-07-03 17:09:25.768600 7f8291dff480 0 starting mon.0 rank 0 at
192.168.2.1:6789/0 mon_data /var/lib/ceph/mon/ceph-0 fsid
1798897a-f0c9-422d-86b3-d4933a12c7ac
2016-07-03 17:09:25.769066 7f8291dff480 1 mon.0@-1(probing) e10
preinit
fsid 1798897a-f0c9-422d-86b3-d4933a12c7ac
2016-07-03 17:09:25.769923 7f8291dff480 1
mon.0@-1(probing).paxosservice(pgmap 17869652..17870289) refresh
upgraded,
format 0 -> 1
2016-07-03 17:09:25.769947 7f8291dff480 1 mon.0@-1(probing).pg v0
on_upgrade discarding in-core PGMap
2016-07-03 17:09:25.776148 7f8291dff480 0 mon.0@-1(probing).mds e1532
print_map
e1532
enable_multiple, ever_enabled_multiple: 0,0
compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable
ranges,3=default file layouts on dirs,4=dir inode in separate
object,5=mds
uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table}
Filesystem 'cephfs' (0)
fs_name cephfs
epoch 1530
flags 0
modified 2016-05-19 01:21:31.953710
tableserver 0
root 0
session_timeout 60
session_autoclose 300
max_file_size 1099511627776
last_failure 1478
last_failure_osd_epoch 26431
compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable
ranges,3=default file layouts on dirs,4=dir inode in separate
object,5=mds
uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table}
max_mds 1
in 0
up {0=1190233}
failed
damaged
stopped
data_pools 0
metadata_pool 1
inline_data disabled
1190233: 192.168.2.6:6800/5437 'B' mds.0.1526 up:active seq
103145
Standby daemons:
1190222: 192.168.2.5:6801/5871 '2' mds.-1.0 up:standby seq
135114
2016-07-03 17:09:25.776444 7f8291dff480 0 mon.0@-1(probing).osd e26460
crush map has features 2200130813952, adjusting msgr requires
2016-07-03 17:09:25.776450 7f8291dff480 0 mon.0@-1(probing).osd e26460
crush map has features 2200130813952, adjusting msgr requires
2016-07-03 17:09:25.776453 7f8291dff480 0 mon.0@-1(probing).osd e26460
crush map has features 2200130813952, adjusting msgr requires
2016-07-03 17:09:25.776454 7f8291dff480 0 mon.0@-1(probing).osd e26460
crush map has features 2200130813952, adjusting msgr requires
2016-07-03 17:09:25.776696 7f8291dff480 1
mon.0@-1(probing).paxosservice(auth 19251..19344) refresh upgraded,
format 0 -> 1
2016-07-03 17:09:25.777650 7f8291dff480 0 mon.0@-1(probing) e10 my
rank is
now 0 (was -1)
2016-07-03 17:09:25.778191 7f8281e89700 0 -- 192.168.2.1:6789/0 >>
192.168.2.5:6789/0 pipe(0x556e0ab1d360 sd=15 :0 s=1 pgs=0 cs=0 l=0
c=0x556e0ab1e620).fault
2016-07-03 17:09:25.778257 7f828208b700 0 -- 192.168.2.1:6789/0 >>
192.168.2.2:6789/0 pipe(0x556e0ab17c80 sd=13 :0 s=1 pgs=0 cs=0 l=0
c=0x556e0ab18f40).fault
2016-07-03 17:09:25.778314 7f8281d88700 0 -- 192.168.2.1:6789/0 >>
192.168.2.6:6789/0 pipe(0x556e0ab1fe50 sd=16 :0 s=1 pgs=0 cs=0 l=0
c=0x556e0ab21160).fault
2016-07-03 17:09:25.778364 7f8281f8a700 0 -- 192.168.2.1:6789/0 >>
192.168.2.4:6789/0 pipe(0x556e0ab1a7e0 sd=14 :0 s=1 pgs=0 cs=0 l=0
c=0x556e0ab1baa0).fault
2016-07-03 17:10:25.777442 7f828408f700 0
mon.0@0(probing).data_health(0)
update_stats avail 71% total 187 GB, used 44468 MB, avail 134 GB
2016-07-03 17:10:26.151363 7f8281486700 0 -- 192.168.2.1:6789/0 >>
192.168.2.6:0/3148669853 pipe(0x7f82640008c0 sd=25 :6789 s=0 pgs=0
cs=0 l=0
c=0x7f8264001bd0).accept peer addr is really 192.168.2.6:0/3148669853
(socket is 192.168.2.6:37584/0)
2016-07-03 17:10:26.543059 7f8281284700 0 -- 192.168.2.1:6789/0 >>
192.168.2.1:0/2736937026 pipe(0x7f8264002f40 sd=26 :6789 s=0 pgs=0
cs=0 l=0
c=0x7f8264004250).accept peer addr is really 192.168.2.1:0/2736937026
(socket is 192.168.2.1:38419/0)
2016-07-03 17:10:36.543026 7f8281284700 0 -- 192.168.2.1:6789/0 >>
192.168.2.1:0/2736937026 pipe(0x7f82640008c0 sd=25 :6789 s=0 pgs=0
cs=0 l=0
c=0x7f8264001bd0).accept peer addr is really 192.168.2.1:0/2736937026
(socket is 192.168.2.1:38437/0)
2016-07-03 17:10:46.151145 7f8281284700 0 -- 192.168.2.1:6789/0 >>
192.168.2.6:0/3148669853 pipe(0x7f82640008c0 sd=25 :6789 s=0 pgs=0
cs=0 l=0
c=0x7f8264001bd0).accept peer addr is really 192.168.2.6:0/3148669853
(socket is 192.168.2.6:37586/0)
2016-07-03 17:10:56.543087 7f8281284700 0 -- 192.168.2.1:6789/0 >>
192.168.2.1:0/2736937026 pipe(0x7f82640008c0 sd=25 :6789 s=0 pgs=0
cs=0 l=0
c=0x7f8264001bd0).accept peer addr is really 192.168.2.1:0/2736937026
(socket is 192.168.2.1:38475/0)
2016-07-03 17:11:06.542941 7f8281082700 0 -- 192.168.2.1:6789/0 >>
192.168.2.1:0/2736937026 pipe(0x7f82640008c0 sd=25 :6789 s=0 pgs=0
cs=0 l=0
c=0x7f8264001bd0).accept peer addr is really 192.168.2.1:0/2736937026
(socket is 192.168.2.1:38494/0)
2016-07-03 17:11:25.777920 7f828408f700 0
mon.0@0(probing).data_health(0)
update_stats avail 71% total 187 GB, used 44468 MB, avail 134 GB
2016-07-03 17:11:36.543002 7f8281284700 0 -- 192.168.2.1:6789/0 >>
192.168.2.1:0/2736937026 pipe(0x7f82640008c0 sd=25 :6789 s=0 pgs=0
cs=0 l=0
c=0x7f8264001bd0).accept peer addr is really 192.168.2.1:0/2736937026
(socket is 192.168.2.1:38556/0)
2016-07-03 17:11:41.382221 7f8281284700 0 -- 192.168.2.1:6789/0 >>
192.168.2.2:0/976433073 pipe(0x7f82640008c0 sd=25 :6789 s=0 pgs=0
cs=0 l=0
c=0x7f8264001bd0).accept peer addr is really 192.168.2.2:0/976433073
(socket
is 192.168.2.2:58252/0)
2016-07-03 17:11:51.382170 7f8281284700 0 -- 192.168.2.1:6789/0 >>
192.168.2.2:0/976433073 pipe(0x7f82640008c0 sd=25 :6789 s=0 pgs=0
cs=0 l=0
c=0x7f8264001bd0).accept peer addr is really 192.168.2.2:0/976433073
(socket
is 192.168.2.2:58253/0)
2016-07-03 17:12:06.543010 7f8281284700 0 -- 192.168.2.1:6789/0 >>
192.168.2.1:0/2736937026 pipe(0x7f82640008c0 sd=25 :6789 s=0 pgs=0
cs=0 l=0
c=0x7f8264001bd0).accept peer addr is really 192.168.2.1:0/2736937026
(socket is 192.168.2.1:38602/0)
2016-07-03 17:12:16.543013 7f8280f81700 0 -- 192.168.2.1:6789/0 >>
192.168.2.1:0/2736937026 pipe(0x7f82640008c0 sd=25 :6789 s=0 pgs=0
cs=0 l=0
c=0x7f8264001bd0).accept peer addr is really 192.168.2.1:0/2736937026
(socket is 192.168.2.1:38618/0)
2016-07-03 17:12:25.778424 7f828408f700 0
mon.0@0(probing).data_health(0)
update_stats avail 71% total 187 GB, used 44468 MB, avail 134 GB
2016-07-03 17:12:36.151275 7f8280f81700 0 -- 192.168.2.1:6789/0 >>
192.168.2.6:0/3148669853 pipe(0x7f82640008c0 sd=25 :6789 s=0 pgs=0
cs=0 l=0
c=0x7f8264001bd0).accept peer addr is really 192.168.2.6:0/3148669853
(socket is 192.168.2.6:37615/0)
2016-07-03 17:12:46.151131 7f8280f81700 0 -- 192.168.2.1:6789/0 >>
192.168.2.6:0/3148669853 pipe(0x7f82640008c0 sd=25 :6789 s=0 pgs=0
cs=0 l=0
c=0x7f8264001bd0).accept peer addr is really 192.168.2.6:0/3148669853
(socket is 192.168.2.6:37627/0)
2016-07-03 17:12:50.102454 7f8281284700 0 -- 192.168.2.1:6789/0 >>
192.168.2.4:6789/0 pipe(0x7f82640055c0 sd=26 :6789 s=0 pgs=0 cs=0 l=0
c=0x7f8264006a80).accept connect_seq 0 vs existing 0 state connecting
2016-07-03 17:12:50.108055 7f828388e700 0 log_channel(cluster) log
[INF] :
mon.0 calling new monitor election
2016-07-03 17:12:50.108235 7f828388e700 1
mon.0@0(electing).elector(12184)
init, last seen epoch 12184
2016-07-03 17:12:50.116334 7f828388e700 0 log_channel(cluster) log
[INF] :
mon.0 calling new monitor election
2016-07-03 17:12:50.116402 7f828388e700 1
mon.0@0(electing).elector(12195)
init, last seen epoch 12195
2016-07-03 17:12:51.382215 7f8280f81700 0 -- 192.168.2.1:6789/0 >>
192.168.2.2:0/976433073 pipe(0x7f82640008c0 sd=25 :6789 s=0 pgs=0
cs=0 l=0
c=0x7f8264001bd0).accept peer addr is really 192.168.2.2:0/976433073
(socket
is 192.168.2.2:58259/0)
2016-07-03 17:12:55.120599 7f828408f700 0 log_channel(cluster) log
[INF] :
mon.0@0 won leader election with quorum 0,2,3
2016-07-03 17:12:55.122490 7f828408f700 0 log_channel(cluster) log
[INF] :
HEALTH_ERR; 187 pgs are stuck inac
tive for more than 300 seconds; 114 pgs degraded; 138 pgs peering;
49 pgs
stale; 138 pgs stuck inactive; 49 pg
s stuck stale; 114 pgs stuck unclean; 114 pgs undersized; recovery
1019372/17575834 objects degraded (5.800%);
too many PGs per OSD (449 > max 300); 1/4 in osds are down; noout
flag(s)
set; 2 mons down, quorum 0,2,3 0,3,4
2016-07-03 17:12:55.193094 7f828388e700 0 log_channel(cluster) log
[INF] :
monmap e10: 5 mons at
{0=192.168.2.1:6789/0,1=192.168.2.2:6789/0,3=192.168.2.4:6789/0,4=192.168.2.5:6789/0,5=192.168.2.6:6789/0}
2016-07-03 17:12:55.193254 7f828388e700 0 log_channel(cluster) log
[INF] :
pgmap v17870289: 768 pgs: 49 stale+active+clean, 114
active+undersized+degraded, 138 peering, 467 active+clean; 7128 GB
data,
16620 GB used, 4824 GB / 22356 GB avail; 1019372/17575834 objects
degraded
(5.800%)
2016-07-03 17:12:55.195553 7f828388e700 -1 mon/MDSMonitor.cc: In
function
'bool
MDSMonitor::maybe_promote_standby(std::shared_ptr<Filesystem>)' thread
7f828388e700 time 2016-07-03 17:12:55.193360
mon/MDSMonitor.cc: 2796: FAILED assert(info.state ==
MDSMap::STATE_STANDBY)
ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x82) [0x556e001f1e12]
2:
(MDSMonitor::maybe_promote_standby(std::shared_ptr<Filesystem>)+0x97f)
[0x556dffede53f]
3: (MDSMonitor::tick()+0x3b6) [0x556dffee0866]
4: (MDSMonitor::on_active()+0x28) [0x556dffed9038]
5: (PaxosService::_active()+0x66a) [0x556dffe5968a]
6: (Context::complete(int)+0x9) [0x556dffe249a9]
7: (void finish_contexts<Context>(CephContext*, std::list<Context*,
std::allocator<Context*> >&, int)+0xac) [0x556dffe2ba7c]
8: (Paxos::finish_round()+0xd0) [0x556dffe50460]
9: (Paxos::handle_last(std::shared_ptr<MonOpRequest>)+0x103d)
[0x556dffe51acd]
10: (Paxos::dispatch(std::shared_ptr<MonOpRequest>)+0x38c)
[0x556dffe5254c]
11: (Monitor::dispatch_op(std::shared_ptr<MonOpRequest>)+0xd3b)
[0x556dffe2245b]
12: (Monitor::_ms_dispatch(Message*)+0x581) [0x556dffe22b91]
13: (Monitor::ms_dispatch(Message*)+0x23) [0x556dffe41393]
14: (DispatchQueue::entry()+0x7ba) [0x556e002e722a]
15: (DispatchQueue::DispatchThread::entry()+0xd) [0x556e001d62cd]
16: (()+0x74a4) [0x7f8290f904a4]
17: (clone()+0x6d) [0x7f828f29298d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to
interpret this.
If asked, I'll dump the rest of the log.
Bill Sharer
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com