I've been in the process of updating my gentoo based cluster both with new hardware and a somewhat postponed update. This includes some major stuff including the switch from gcc 4.x to 5.4.0 on existing hardware and using gcc 6.4.0 to make better use of AMD Ryzen on the new hardware. The existing cluster was on 10.2.2, but I was going to 10.2.7-r1 as an interim step before moving on to 12.2.0 to begin transitioning to bluestore on the osd's. The Ryzen units are slated to be bluestore based OSD servers if and when I get to that point. Up until the mds failure, they were simply cephfs clients. I had three OSD servers updated to 10.2.7-r1 (one is also a MON) and had two servers left to update. Both of these are also MONs and were acting as a pair of dual active MDS servers running 10.2.2. Monday morning I found out the hard way that an UPS one of them was on has a dead battery. After I fsck'd and came back up, I saw the following assertion error when it was trying to start it's mds.B server: ==== mdsbeacon(64162/B up:replay seq 3 v4699) v7 ==== 126+0+0 (709014160 0 0) 0x7f6fb4001bc0 con 0x55f94779d 8d0 0> 2017-10-09 11:43:06.935662 7f6fa9ffb700 -1 mds/journal.cc: In function 'virtual void EImportStart::r eplay(MDSRank*)' thread 7f6fa9ffb700 time 2017-10-09 11:43:06.934972 mds/journal.cc: 2929: FAILED assert(mds->sessionmap.get_version() == cmapv) ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x82) [0x55f93d64a122] 2: (EImportStart::replay(MDSRank*)+0x9ce) [0x55f93d52a5ce] 3: (MDLog::_replay_thread()+0x4f4) [0x55f93d4a8e34] 4: (MDLog::ReplayThread::entry()+0xd) [0x55f93d25bd4d] 5: (()+0x74a4) [0x7f6fd009b4a4] 6: (clone()+0x6d) [0x7f6fce5a598d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 rbd_mirror 0/ 5 rbd_replay 0/ 5 journaler 0/ 5 objectcacher 0/ 5 client 0/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 journal 0/ 5 ms 1/ 5 mon 0/10 monc 1/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/10 civetweb 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle 0/ 0 refs 1/ 5 xio 1/ 5 compressor 1/ 5 newstore 1/ 5 bluestore 1/ 5 bluefs 1/ 3 bdev 1/ 5 kstore 4/ 5 rocksdb 4/ 5 leveldb 1/ 5 kinetic 1/ 5 fuse -2/-2 (syslog threshold) -1/-1 (stderr threshold) max_recent 10000 max_new 1000 log_file /var/log/ceph/ceph-mds.B.log When I was googling around, I ran into this Cern presentation and tried out the offline backware scrubbing commands on slide 25 first: https://indico.cern.ch/event/531810/contributions/2309925/attachments/1357386/2053998/GoncaloBorges-HEPIX16-v3.pdf Both ran without any messages, so I'm assuming I have sane contents in the cephfs_data and cephfs_metadata pools. Still no luck getting things restarted, so I tried the cephfs-journal-tool journal reset on slide 23. That didn't work either. Just for giggles, I tried setting up the two Ryzen boxes as new mds.C and mds.D servers which would run on 10.2.7-r1 instead of using mds.A and mds.B (10.2.2). The D server fails with the same assert as follows: === 132+0+1979520 (4198351460 0 1611007530) 0x7fffc4000a70 con 0x7fffe0013310 0> 2017-10-09 13:01:31.571195 7fffd99f5700 -1 mds/journal.cc: In function 'virtual void EImportStart::replay(MDSRank*)' thread 7fffd99f5700 time 2017-10-09 13:01:31.570608 mds/journal.cc: 2949: FAILED assert(mds->sessionmap.get_version() == cmapv) ceph version 10.2.7 (50e863e0f4bc8f4b9e31156de690d765af245185) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x80) [0x555555b7ebc8] 2: (EImportStart::replay(MDSRank*)+0x9ea) [0x555555a5674a] 3: (MDLog::_replay_thread()+0xe51) [0x5555559cef21] 4: (MDLog::ReplayThread::entry()+0xd) [0x5555557778cd] 5: (()+0x7364) [0x7ffff7bc5364] 6: (clone()+0x6d) [0x7ffff6051ccd] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com