Re: mds lost very frequently

"Yan, Zheng" <ukernel@xxxxxxxxx> · Thu, 13 Dec 2018 15:44:17 +0800

On Thu, Dec 13, 2018 at 2:55 AM Sang, Oliver <oliver.sang@xxxxxxxxx> wrote:
>
> We are using luminous, we have seven ceph nodes and setup them all as MDS.
>
> Recently the MDS lost very frequently, and when there is only one MDS left, the cephfs just degraded to unusable.
>
>
>
> Checked the mds log in one ceph node, I found below
>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>
> /build/ceph-12.2.8/src/mds/Locker.cc: 5076: FAILED assert(lock->get_state() == LOCK_PRE_SCAN)
>
>
>
> ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) luminous (stable)
>
> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x564400e50e42]
>
> 2: (Locker::file_recover(ScatterLock*)+0x208) [0x564400c6ae18]
>
> 3: (MDCache::start_files_to_recover()+0xb3) [0x564400b98af3]
>
> 4: (MDSRank::clientreplay_start()+0x1f7) [0x564400ae04c7]
>
> 5: (MDSRankDispatcher::handle_mds_map(MMDSMap*, MDSMap*)+0x25c0) [0x564400aefd40]
>
> 6: (MDSDaemon::handle_mds_map(MMDSMap*)+0x154d) [0x564400ace3bd]
>
> 7: (MDSDaemon::handle_core_message(Message*)+0x7f3) [0x564400ad1273]
>
> 8: (MDSDaemon::ms_dispatch(Message*)+0x1c3) [0x564400ad15a3]
>
> 9: (DispatchQueue::entry()+0xeda) [0x5644011a547a]
>
> 10: (DispatchQueue::DispatchThread::entry()+0xd) [0x564400ee3fcd]
>
> 11: (()+0x7494) [0x7f7a2b106494]
>
> 12: (clone()+0x3f) [0x7f7a2a17eaff]
>
> NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
>
> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
>
>
>
> The full log is also attached. Could you please help us? Thanks!
>
>

Please try below patch if you can compile ceph from source.  If you
can't compile ceph or the issue still happens, please set  debug_mds =
10 for standby mds (change debug_mds to 0 after mds becomes active).

Regards
Yan, Zheng

diff --git a/src/mds/MDSRank.cc b/src/mds/MDSRank.cc
index 1e8b024b8a..d1150578f1 100644
--- a/src/mds/MDSRank.cc
+++ b/src/mds/MDSRank.cc
@@ -1454,8 +1454,8 @@ void MDSRank::rejoin_done()
 void MDSRank::clientreplay_start()
 {
   dout(1) << "clientreplay_start" << dendl;
-  finish_contexts(g_ceph_context, waiting_for_replay);  // kick waiters
   mdcache->start_files_to_recover();
+  finish_contexts(g_ceph_context, waiting_for_replay);  // kick waiters
   queue_one_replay();
 }

@@ -1487,8 +1487,8 @@ void MDSRank::active_start()

   mdcache->clean_open_file_lists();
   mdcache->export_remaining_imported_caps();
-  finish_contexts(g_ceph_context, waiting_for_replay);  // kick waiters
   mdcache->start_files_to_recover();
+  finish_contexts(g_ceph_context, waiting_for_replay);  // kick waiters

   mdcache->reissue_all_caps();
   mdcache->activate_stray_manager();



>
> BR
>
> Oliver
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com