assertion error trying to start mds server

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I've been in the process of updating my gentoo based cluster both with
new hardware and a somewhat postponed update.  This includes some major
stuff including the switch from gcc 4.x to 5.4.0 on existing hardware
and using gcc 6.4.0 to make better use of AMD Ryzen on the new
hardware.  The existing cluster was on 10.2.2, but I was going to
10.2.7-r1 as an interim step before moving on to 12.2.0 to begin
transitioning to bluestore on the osd's.

The Ryzen units are slated to be bluestore based OSD servers if and when
I get to that point.  Up until the mds failure, they were simply cephfs
clients.  I had three OSD servers updated to 10.2.7-r1 (one is also a
MON) and had two servers left to update.  Both of these are also MONs
and were acting as a pair of dual active MDS servers running 10.2.2. 
Monday morning I found out the hard way that an UPS one of them was on
has a dead battery.  After I fsck'd and came back up, I saw the
following assertion error when it was trying to start it's mds.B server:


==== mdsbeacon(64162/B up:replay seq 3 v4699) v7 ==== 126+0+0 (709014160
0 0) 0x7f6fb4001bc0 con 0x55f94779d
8d0
     0> 2017-10-09 11:43:06.935662 7f6fa9ffb700 -1 mds/journal.cc: In
function 'virtual void EImportStart::r
eplay(MDSRank*)' thread 7f6fa9ffb700 time 2017-10-09 11:43:06.934972
mds/journal.cc: 2929: FAILED assert(mds->sessionmap.get_version() == cmapv)

 ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x82) [0x55f93d64a122]
 2: (EImportStart::replay(MDSRank*)+0x9ce) [0x55f93d52a5ce]
 3: (MDLog::_replay_thread()+0x4f4) [0x55f93d4a8e34]
 4: (MDLog::ReplayThread::entry()+0xd) [0x55f93d25bd4d]
 5: (()+0x74a4) [0x7f6fd009b4a4]
 6: (clone()+0x6d) [0x7f6fce5a598d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 xio
   1/ 5 compressor
   1/ 5 newstore
   1/ 5 bluestore
   1/ 5 bluefs
   1/ 3 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   4/ 5 leveldb
   1/ 5 kinetic
   1/ 5 fuse
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-mds.B.log



When I was googling around, I ran into this Cern presentation and tried
out the offline backware scrubbing commands on slide 25 first:

https://indico.cern.ch/event/531810/contributions/2309925/attachments/1357386/2053998/GoncaloBorges-HEPIX16-v3.pdf


Both ran without any messages, so I'm assuming I have sane contents in
the cephfs_data and cephfs_metadata pools.  Still no luck getting things
restarted, so I tried the cephfs-journal-tool journal reset on slide
23.  That didn't work either.  Just for giggles, I tried setting up the
two Ryzen boxes as new mds.C and mds.D servers which would run on
10.2.7-r1 instead of using mds.A and mds.B (10.2.2).  The D server fails
with the same assert as follows:


=== 132+0+1979520 (4198351460 0 1611007530) 0x7fffc4000a70 con
0x7fffe0013310
     0> 2017-10-09 13:01:31.571195 7fffd99f5700 -1 mds/journal.cc: In
function 'virtual void EImportStart::replay(MDSRank*)' thread
7fffd99f5700 time 2017-10-09 13:01:31.570608
mds/journal.cc: 2949: FAILED assert(mds->sessionmap.get_version() == cmapv)

 ceph version 10.2.7 (50e863e0f4bc8f4b9e31156de690d765af245185)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x80) [0x555555b7ebc8]
 2: (EImportStart::replay(MDSRank*)+0x9ea) [0x555555a5674a]
 3: (MDLog::_replay_thread()+0xe51) [0x5555559cef21]
 4: (MDLog::ReplayThread::entry()+0xd) [0x5555557778cd]
 5: (()+0x7364) [0x7ffff7bc5364]
 6: (clone()+0x6d) [0x7ffff6051ccd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.


_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux