Re: mds lost very frequently

"Sang, Oliver" <oliver.sang@xxxxxxxxx> · Thu, 13 Dec 2018 13:25:23 +0000

Thanks a lot, Yan Zheng!

Regarding the " set  debug_mds =10 for standby mds (change debug_mds to 0 after mds becomes active)."
Could you please explain the purpose? Just want to collect debug log, or it really has the side effect to prevent mds lost?

Regarding the patch itself. Sorry we didn't compile from source. However, may I ask whether it will be included in a new v12 release in the future? Thanks

BR
Oliver

-----Original Message-----
From: Yan, Zheng [mailto:ukernel@xxxxxxxxx] 
Sent: Thursday, December 13, 2018 3:44 PM
To: Sang, Oliver <oliver.sang@xxxxxxxxx>
Cc: ceph-users@xxxxxxxxxxxxxx
Subject: Re:  mds lost very frequently

On Thu, Dec 13, 2018 at 2:55 AM Sang, Oliver <oliver.sang@xxxxxxxxx> wrote:
>
> We are using luminous, we have seven ceph nodes and setup them all as MDS.
>
> Recently the MDS lost very frequently, and when there is only one MDS left, the cephfs just degraded to unusable.
>
>
>
> Checked the mds log in one ceph node, I found below
>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>
> /build/ceph-12.2.8/src/mds/Locker.cc: 5076: FAILED 
> assert(lock->get_state() == LOCK_PRE_SCAN)
>
>
>
> ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) 
> luminous (stable)
>
> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
> const*)+0x102) [0x564400e50e42]
>
> 2: (Locker::file_recover(ScatterLock*)+0x208) [0x564400c6ae18]
>
> 3: (MDCache::start_files_to_recover()+0xb3) [0x564400b98af3]
>
> 4: (MDSRank::clientreplay_start()+0x1f7) [0x564400ae04c7]
>
> 5: (MDSRankDispatcher::handle_mds_map(MMDSMap*, MDSMap*)+0x25c0) 
> [0x564400aefd40]
>
> 6: (MDSDaemon::handle_mds_map(MMDSMap*)+0x154d) [0x564400ace3bd]
>
> 7: (MDSDaemon::handle_core_message(Message*)+0x7f3) [0x564400ad1273]
>
> 8: (MDSDaemon::ms_dispatch(Message*)+0x1c3) [0x564400ad15a3]
>
> 9: (DispatchQueue::entry()+0xeda) [0x5644011a547a]
>
> 10: (DispatchQueue::DispatchThread::entry()+0xd) [0x564400ee3fcd]
>
> 11: (()+0x7494) [0x7f7a2b106494]
>
> 12: (clone()+0x3f) [0x7f7a2a17eaff]
>
> NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
>
> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
>
>
>
> The full log is also attached. Could you please help us? Thanks!
>
>

Please try below patch if you can compile ceph from source.  If you can't compile ceph or the issue still happens, please set  debug_mds =
10 for standby mds (change debug_mds to 0 after mds becomes active).

Regards
Yan, Zheng

diff --git a/src/mds/MDSRank.cc b/src/mds/MDSRank.cc index 1e8b024b8a..d1150578f1 100644
--- a/src/mds/MDSRank.cc
+++ b/src/mds/MDSRank.cc
@@ -1454,8 +1454,8 @@ void MDSRank::rejoin_done()  void MDSRank::clientreplay_start()  {
   dout(1) << "clientreplay_start" << dendl;
-  finish_contexts(g_ceph_context, waiting_for_replay);  // kick waiters
   mdcache->start_files_to_recover();
+  finish_contexts(g_ceph_context, waiting_for_replay);  // kick waiters
   queue_one_replay();
 }

@@ -1487,8 +1487,8 @@ void MDSRank::active_start()

   mdcache->clean_open_file_lists();
   mdcache->export_remaining_imported_caps();
-  finish_contexts(g_ceph_context, waiting_for_replay);  // kick waiters
   mdcache->start_files_to_recover();
+  finish_contexts(g_ceph_context, waiting_for_replay);  // kick waiters

   mdcache->reissue_all_caps();
   mdcache->activate_stray_manager();



>
> BR
>
> Oliver
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com