Hi,
I am trying to repair a failed cluster with multiple MDS, but the failed MDS crashes on restart and won't stay up. I could not find a bug report for that specific failure. Here are the logs:
-9> 2018-07-27 10:40:45.591137 7f239ae9a700 5 mds.lift-2 handle_mds_map epoch 3562 from mds.2
-8> 2018-07-27 10:40:45.591138 7f239ae9a700 5 mds.lift-2 old map epoch 3562 <= 3562, discarding
-7> 2018-07-27 10:40:45.593404 7f239d66d700 5 -- 10.0.10.106:6800/3953731285 >> 10.0.10.6:6801/3462988481 conn(0x55cdc9cb9000 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=233 cs=1 l=0). rx mds.2 seq 4 0x55cdc9a85800 cache_rejoin strong v1
-6> 2018-07-27 10:40:45.593430 7f239d66d700 5 -- 10.0.10.106:6800/3953731285 >> 10.0.10.6:6801/3462988481 conn(0x55cdc9cb9000 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=233 cs=1 l=0). rx mds.2 seq 5 0x55cdc9a85e00 cache_rejoin ack v1
-5> 2018-07-27 10:40:45.593470 7f239ae9a700 1 -- 10.0.10.106:6800/3953731285 <== mds.2 10.0.10.6:6801/3462988481 4 ==== cache_rejoin strong v1 ==== 549811+0+0 (3780490909 0 0) 0x55cdc9a85800 con 0x55cdc9cb9000
-4> 2018-07-27 10:40:45.611787 7f239ae9a700 1 -- 10.0.10.106:6800/3953731285 <== mds.2 10.0.10.6:6801/3462988481 5 ==== cache_rejoin ack v1 ==== 84+0+0 (1576247674 0 0) 0x55cdc9a85e00 con 0x55cdc9cb9000
-3> 2018-07-27 10:40:45.659162 7f239ce6c700 5 -- 10.0.10.106:6800/3953731285 >> 10.0.10.7:6804/2014675692 conn(0x55cdc9cb7800 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=291 cs=1 l=0). rx mds.1 seq 2 0x55cdc9a84600 cache_rejoin strong v1
-2> 2018-07-27 10:40:45.659244 7f239ce6c700 5 -- 10.0.10.106:6800/3953731285 >> 10.0.10.7:6804/2014675692 conn(0x55cdc9cb7800 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=291 cs=1 l=0). rx mds.1 seq 3 0x55cdc9a84c00 cache_rejoin ack v1
-1> 2018-07-27 10:40:45.659285 7f239ae9a700 1 -- 10.0.10.106:6800/3953731285 <== mds.1 10.0.10.7:6804/2014675692 2 ==== cache_rejoin strong v1 ==== 1435225+0+0 (4071039122 0 0) 0x55cdc9a84600 con 0x55cdc9cb7800
0> 2018-07-27 10:40:45.664103 7f239ae9a700 -1 /build/ceph-12.2.7/src/mds/MDCache.cc: In function 'void MDCache::handle_cache_rejoin_strong(MMDSCacheRejoin*)' thread 7f239ae9a700 time 2018-07-27 10:40:45.659346
/build/ceph-12.2.7/src/mds/MDCache.cc: 4632: FAILED assert(in)
ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5) luminous (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x55cdc05b0ac2]
2: (MDCache::handle_cache_rejoin_strong(MMDSCacheRejoin*)+0x1b2) [0x55cdc0360662]
3: (MDCache::handle_cache_rejoin(MMDSCacheRejoin*)+0x1eb) [0x55cdc036424b]
4: (MDCache::dispatch(Message*)+0xa5) [0x55cdc03727b5]
5: (MDSRank::handle_deferrable_message(Message*)+0x66c) [0x55cdc023b55c]
6: (MDSRank::_dispatch(Message*, bool)+0x20b) [0x55cdc024a57b]
7: (MDSRankDispatcher::ms_dispatch(Message*)+0x15) [0x55cdc024b385]
8: (MDSDaemon::ms_dispatch(Message*)+0x1db) [0x55cdc02336bb]
9: (DispatchQueue::entry()+0xeda) [0x55cdc090238a]
10: (DispatchQueue::DispatchThread::entry()+0xd) [0x55cdc06439fd]
11: (()+0x7494) [0x7f23a09ec494]
12: (clone()+0x3f) [0x7f239fa64aff]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Please advise.
Thanks
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com