On Fri, Jul 27, 2018 at 4:47 PM Guillaume Lefranc <guillaume@xxxxxxxxxxxx> wrote: > > Hi, > > I am trying to repair a failed cluster with multiple MDS, but the failed MDS crashes on restart and won't stay up. I could not find a bug report for that specific failure. Here are the logs: > > -9> 2018-07-27 10:40:45.591137 7f239ae9a700 5 mds.lift-2 handle_mds_map epoch 3562 from mds.2 > -8> 2018-07-27 10:40:45.591138 7f239ae9a700 5 mds.lift-2 old map epoch 3562 <= 3562, discarding > -7> 2018-07-27 10:40:45.593404 7f239d66d700 5 -- 10.0.10.106:6800/3953731285 >> 10.0.10.6:6801/3462988481 conn(0x55cdc9cb9000 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=233 cs=1 l=0). rx mds.2 seq 4 0x55cdc9a85800 cache_rejoin strong v1 > -6> 2018-07-27 10:40:45.593430 7f239d66d700 5 -- 10.0.10.106:6800/3953731285 >> 10.0.10.6:6801/3462988481 conn(0x55cdc9cb9000 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=233 cs=1 l=0). rx mds.2 seq 5 0x55cdc9a85e00 cache_rejoin ack v1 > -5> 2018-07-27 10:40:45.593470 7f239ae9a700 1 -- 10.0.10.106:6800/3953731285 <== mds.2 10.0.10.6:6801/3462988481 4 ==== cache_rejoin strong v1 ==== 549811+0+0 (3780490909 0 0) 0x55cdc9a85800 con 0x55cdc9cb9000 > -4> 2018-07-27 10:40:45.611787 7f239ae9a700 1 -- 10.0.10.106:6800/3953731285 <== mds.2 10.0.10.6:6801/3462988481 5 ==== cache_rejoin ack v1 ==== 84+0+0 (1576247674 0 0) 0x55cdc9a85e00 con 0x55cdc9cb9000 > -3> 2018-07-27 10:40:45.659162 7f239ce6c700 5 -- 10.0.10.106:6800/3953731285 >> 10.0.10.7:6804/2014675692 conn(0x55cdc9cb7800 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=291 cs=1 l=0). rx mds.1 seq 2 0x55cdc9a84600 cache_rejoin strong v1 > -2> 2018-07-27 10:40:45.659244 7f239ce6c700 5 -- 10.0.10.106:6800/3953731285 >> 10.0.10.7:6804/2014675692 conn(0x55cdc9cb7800 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=291 cs=1 l=0). rx mds.1 seq 3 0x55cdc9a84c00 cache_rejoin ack v1 > -1> 2018-07-27 10:40:45.659285 7f239ae9a700 1 -- 10.0.10.106:6800/3953731285 <== mds.1 10.0.10.7:6804/2014675692 2 ==== cache_rejoin strong v1 ==== 1435225+0+0 (4071039122 0 0) 0x55cdc9a84600 con 0x55cdc9cb7800 > 0> 2018-07-27 10:40:45.664103 7f239ae9a700 -1 /build/ceph-12.2.7/src/mds/MDCache.cc: In function 'void MDCache::handle_cache_rejoin_strong(MMDSCacheRejoin*)' thread 7f239ae9a700 time 2018-07-27 10:40:45.659346 > /build/ceph-12.2.7/src/mds/MDCache.cc: 4632: FAILED assert(in) > > ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5) luminous (stable) > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x55cdc05b0ac2] > 2: (MDCache::handle_cache_rejoin_strong(MMDSCacheRejoin*)+0x1b2) [0x55cdc0360662] > 3: (MDCache::handle_cache_rejoin(MMDSCacheRejoin*)+0x1eb) [0x55cdc036424b] > 4: (MDCache::dispatch(Message*)+0xa5) [0x55cdc03727b5] > 5: (MDSRank::handle_deferrable_message(Message*)+0x66c) [0x55cdc023b55c] > 6: (MDSRank::_dispatch(Message*, bool)+0x20b) [0x55cdc024a57b] > 7: (MDSRankDispatcher::ms_dispatch(Message*)+0x15) [0x55cdc024b385] > 8: (MDSDaemon::ms_dispatch(Message*)+0x1db) [0x55cdc02336bb] > 9: (DispatchQueue::entry()+0xeda) [0x55cdc090238a] > 10: (DispatchQueue::DispatchThread::entry()+0xd) [0x55cdc06439fd] > 11: (()+0x7494) [0x7f23a09ec494] > 12: (clone()+0x3f) [0x7f239fa64aff] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. > try restarting all mds > Please advise. > Thanks > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com