Re: mds lost very frequently

Stefan Kooman <stefan@xxxxxx> · Wed, 24 Feb 2021 21:59:17 +0100

On 2/6/20 6:04 PM, Stefan Kooman wrote:
Hi,

After setting:

ceph config set mds mds_recall_max_caps 10000

(5000 before change)

and

ceph config set mds mds_recall_max_decay_rate 1.0

(2.5 before change)

And the:

ceph tell 'mds.*' injectargs '--mds_recall_max_caps 10000'
ceph tell 'mds.*' injectargs '--mds_recall_max_decay_rate 1.0'

our up:active MDS stopped responding and the standby-replay stepped in
... and hit an assert (same as in this thread):

2020-02-06 16:42:16.712 7ff76a528700  1 heartbeat_map reset_timeout 'MDSRank' had timed out after 15
2020-02-06 16:42:17.616 7ff76ff1b700  0 mds.beacon.mds2  MDS is no longer laggy
2020-02-06 16:42:20.348 7ff76d716700 -1 /build/ceph-13.2.8/src/mds/Locker.cc: In function 'void Locker::file_recover(ScatterLock*)' thread 7ff76d716700 time 2020-02-06 16:42:20.351124
/build/ceph-13.2.8/src/mds/Locker.cc: 5307: FAILED assert(lock->get_state() == LOCK_PRE_SCAN)

  ceph version 13.2.8 (5579a94fafbc1f9cc913a0f5d362953a5d9c3ae0) mimic (stable)
  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14e) [0x7ff7759939de]
  2: (()+0x287b67) [0x7ff775993b67]
  3: (()+0x28a9ea) [0x5585eb2b79ea]
  4: (MDCache::start_files_to_recover()+0xbb) [0x5585eb1f897b]
  5: (MDSRank::active_start()+0x135) [0x5585eb146be5]
  6: (MDSRankDispatcher::handle_mds_map(MMDSMap*, MDSMap*)+0x4e5) [0x5585eb151ea5]
  7: (MDSDaemon::handle_mds_map(MMDSMap*)+0xca8) [0x5585eb134608]
  8: (MDSDaemon::handle_core_message(Message*)+0x6c) [0x5585eb138bbc]
  9: (MDSDaemon::ms_dispatch(Message*)+0xbb) [0x5585eb13929b]
  10: (DispatchQueue::entry()+0xb92) [0x7ff775a56e52]
  11: (DispatchQueue::DispatchThread::entry()+0xd) [0x7ff775af3e2d]
  12: (()+0x76db) [0x7ff7752846db]
  13: (clone()+0x3f) [0x7ff77446a88f]

2020-02-06 16:42:20.348 7ff76d716700 -1 *** Caught signal (Aborted) **
  in thread 7ff76d716700 thread_name:ms_dispatch

  ceph version 13.2.8 (5579a94fafbc1f9cc913a0f5d362953a5d9c3ae0) mimic (stable)
  1: (()+0x12890) [0x7ff77528f890]
  2: (gsignal()+0xc7) [0x7ff774387e97]
  3: (abort()+0x141) [0x7ff774389801]
  4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x256) [0x7ff775993ae6]
  5: (()+0x287b67) [0x7ff775993b67]
  6: (()+0x28a9ea) [0x5585eb2b79ea]
  7: (MDCache::start_files_to_recover()+0xbb) [0x5585eb1f897b]
  8: (MDSRank::active_start()+0x135) [0x5585eb146be5]
  9: (MDSRankDispatcher::handle_mds_map(MMDSMap*, MDSMap*)+0x4e5) [0x5585eb151ea5]
  10: (MDSDaemon::handle_mds_map(MMDSMap*)+0xca8) [0x5585eb134608]
  11: (MDSDaemon::handle_core_message(Message*)+0x6c) [0x5585eb138bbc]
  12: (MDSDaemon::ms_dispatch(Message*)+0xbb) [0x5585eb13929b]
  13: (DispatchQueue::entry()+0xb92) [0x7ff775a56e52]
  14: (DispatchQueue::DispatchThread::entry()+0xd) [0x7ff775af3e2d]
  15: (()+0x76db) [0x7ff7752846db]
  16: (clone()+0x3f) [0x7ff77446a88f]
  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.



Quoting Yan, Zheng (ukernel@xxxxxxxxx):

Please try below patch if you can compile ceph from source.  If you
can't compile ceph or the issue still happens, please set  debug_mds =
10 for standby mds (change debug_mds to 0 after mds becomes active).

Regards
Yan, Zheng

diff --git a/src/mds/MDSRank.cc b/src/mds/MDSRank.cc
index 1e8b024b8a..d1150578f1 100644
--- a/src/mds/MDSRank.cc
+++ b/src/mds/MDSRank.cc
@@ -1454,8 +1454,8 @@ void MDSRank::rejoin_done()
  void MDSRank::clientreplay_start()
  {
    dout(1) << "clientreplay_start" << dendl;
-  finish_contexts(g_ceph_context, waiting_for_replay);  // kick waiters
    mdcache->start_files_to_recover();
+  finish_contexts(g_ceph_context, waiting_for_replay);  // kick waiters
    queue_one_replay();
  }

@@ -1487,8 +1487,8 @@ void MDSRank::active_start()

    mdcache->clean_open_file_lists();
    mdcache->export_remaining_imported_caps();
-  finish_contexts(g_ceph_context, waiting_for_replay);  // kick waiters
    mdcache->start_files_to_recover();
+  finish_contexts(g_ceph_context, waiting_for_replay);  // kick waiters

    mdcache->reissue_all_caps();
    mdcache->activate_stray_manager();

AFAICT this patch has never been tested and never commited. Do you still think
this might fix the issue? Any hints on how we might reproduce this issue:
failing active mds and hitting this specific recovery scenario

We will happily apply this patch and do testing to check if it really fixes the
issue.

Apparently there is a tracker for this 
(https://tracker.ceph.com/issues/48096) and a backport for octopus: 
https://tracker.ceph.com/issues/48096 and nautilus is on its way: 
https://tracker.ceph.com/issues/48095

FYI,

Gr. Stefan
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx