Issue with Rejoining MDS

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi, 

I am trying to repair a failed cluster with multiple MDS, but the failed MDS crashes on restart and won't stay up. I could not find a bug report for that specific failure. Here are the logs:

    -9> 2018-07-27 10:40:45.591137 7f239ae9a700  5 mds.lift-2 handle_mds_map epoch 3562 from mds.2
    -8> 2018-07-27 10:40:45.591138 7f239ae9a700  5 mds.lift-2  old map epoch 3562 <= 3562, discarding
    -7> 2018-07-27 10:40:45.593404 7f239d66d700  5 -- 10.0.10.106:6800/3953731285 >> 10.0.10.6:6801/3462988481 conn(0x55cdc9cb9000 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=233 cs=1 l=0). rx mds.2 seq 4 0x55cdc9a85800 cache_rejoin strong v1
    -6> 2018-07-27 10:40:45.593430 7f239d66d700  5 -- 10.0.10.106:6800/3953731285 >> 10.0.10.6:6801/3462988481 conn(0x55cdc9cb9000 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=233 cs=1 l=0). rx mds.2 seq 5 0x55cdc9a85e00 cache_rejoin ack v1
    -5> 2018-07-27 10:40:45.593470 7f239ae9a700  1 -- 10.0.10.106:6800/3953731285 <== mds.2 10.0.10.6:6801/3462988481 4 ==== cache_rejoin strong v1 ==== 549811+0+0 (3780490909 0 0) 0x55cdc9a85800 con 0x55cdc9cb9000
    -4> 2018-07-27 10:40:45.611787 7f239ae9a700  1 -- 10.0.10.106:6800/3953731285 <== mds.2 10.0.10.6:6801/3462988481 5 ==== cache_rejoin ack v1 ==== 84+0+0 (1576247674 0 0) 0x55cdc9a85e00 con 0x55cdc9cb9000
    -3> 2018-07-27 10:40:45.659162 7f239ce6c700  5 -- 10.0.10.106:6800/3953731285 >> 10.0.10.7:6804/2014675692 conn(0x55cdc9cb7800 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=291 cs=1 l=0). rx mds.1 seq 2 0x55cdc9a84600 cache_rejoin strong v1
    -2> 2018-07-27 10:40:45.659244 7f239ce6c700  5 -- 10.0.10.106:6800/3953731285 >> 10.0.10.7:6804/2014675692 conn(0x55cdc9cb7800 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=291 cs=1 l=0). rx mds.1 seq 3 0x55cdc9a84c00 cache_rejoin ack v1
    -1> 2018-07-27 10:40:45.659285 7f239ae9a700  1 -- 10.0.10.106:6800/3953731285 <== mds.1 10.0.10.7:6804/2014675692 2 ==== cache_rejoin strong v1 ==== 1435225+0+0 (4071039122 0 0) 0x55cdc9a84600 con 0x55cdc9cb7800
     0> 2018-07-27 10:40:45.664103 7f239ae9a700 -1 /build/ceph-12.2.7/src/mds/MDCache.cc: In function 'void MDCache::handle_cache_rejoin_strong(MMDSCacheRejoin*)' thread 7f239ae9a700 time 2018-07-27 10:40:45.659346
/build/ceph-12.2.7/src/mds/MDCache.cc: 4632: FAILED assert(in)

 ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5) luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x55cdc05b0ac2]
 2: (MDCache::handle_cache_rejoin_strong(MMDSCacheRejoin*)+0x1b2) [0x55cdc0360662]
 3: (MDCache::handle_cache_rejoin(MMDSCacheRejoin*)+0x1eb) [0x55cdc036424b]
 4: (MDCache::dispatch(Message*)+0xa5) [0x55cdc03727b5]
 5: (MDSRank::handle_deferrable_message(Message*)+0x66c) [0x55cdc023b55c]
 6: (MDSRank::_dispatch(Message*, bool)+0x20b) [0x55cdc024a57b]
 7: (MDSRankDispatcher::ms_dispatch(Message*)+0x15) [0x55cdc024b385]
 8: (MDSDaemon::ms_dispatch(Message*)+0x1db) [0x55cdc02336bb]
 9: (DispatchQueue::entry()+0xeda) [0x55cdc090238a]
 10: (DispatchQueue::DispatchThread::entry()+0xd) [0x55cdc06439fd]
 11: (()+0x7494) [0x7f23a09ec494]
 12: (clone()+0x3f) [0x7f239fa64aff]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Please advise.
Thanks
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux