Re: mds lost very frequently

Stefan Kooman <stefan@xxxxxx> · Thu, 6 Feb 2020 18:04:16 +0100

Hi,

After setting:

ceph config set mds mds_recall_max_caps 10000

(5000 before change)

and 

ceph config set mds mds_recall_max_decay_rate 1.0

(2.5 before change)

And the:

ceph tell 'mds.*' injectargs '--mds_recall_max_caps 10000'
ceph tell 'mds.*' injectargs '--mds_recall_max_decay_rate 1.0'

our up:active MDS stopped responding and the standby-replay stepped in
... and hit an assert (same as in this thread):

2020-02-06 16:42:16.712 7ff76a528700  1 heartbeat_map reset_timeout 'MDSRank' had timed out after 15
2020-02-06 16:42:17.616 7ff76ff1b700  0 mds.beacon.mds2  MDS is no longer laggy
2020-02-06 16:42:20.348 7ff76d716700 -1 /build/ceph-13.2.8/src/mds/Locker.cc: In function 'void Locker::file_recover(ScatterLock*)' thread 7ff76d716700 time 2020-02-06 16:42:20.351124
/build/ceph-13.2.8/src/mds/Locker.cc: 5307: FAILED assert(lock->get_state() == LOCK_PRE_SCAN)

 ceph version 13.2.8 (5579a94fafbc1f9cc913a0f5d362953a5d9c3ae0) mimic (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14e) [0x7ff7759939de]
 2: (()+0x287b67) [0x7ff775993b67]
 3: (()+0x28a9ea) [0x5585eb2b79ea]
 4: (MDCache::start_files_to_recover()+0xbb) [0x5585eb1f897b]
 5: (MDSRank::active_start()+0x135) [0x5585eb146be5]
 6: (MDSRankDispatcher::handle_mds_map(MMDSMap*, MDSMap*)+0x4e5) [0x5585eb151ea5]
 7: (MDSDaemon::handle_mds_map(MMDSMap*)+0xca8) [0x5585eb134608]
 8: (MDSDaemon::handle_core_message(Message*)+0x6c) [0x5585eb138bbc]
 9: (MDSDaemon::ms_dispatch(Message*)+0xbb) [0x5585eb13929b]
 10: (DispatchQueue::entry()+0xb92) [0x7ff775a56e52]
 11: (DispatchQueue::DispatchThread::entry()+0xd) [0x7ff775af3e2d]
 12: (()+0x76db) [0x7ff7752846db]
 13: (clone()+0x3f) [0x7ff77446a88f]

2020-02-06 16:42:20.348 7ff76d716700 -1 *** Caught signal (Aborted) **
 in thread 7ff76d716700 thread_name:ms_dispatch

 ceph version 13.2.8 (5579a94fafbc1f9cc913a0f5d362953a5d9c3ae0) mimic (stable)
 1: (()+0x12890) [0x7ff77528f890]
 2: (gsignal()+0xc7) [0x7ff774387e97]
 3: (abort()+0x141) [0x7ff774389801]
 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x256) [0x7ff775993ae6]
 5: (()+0x287b67) [0x7ff775993b67]
 6: (()+0x28a9ea) [0x5585eb2b79ea]
 7: (MDCache::start_files_to_recover()+0xbb) [0x5585eb1f897b]
 8: (MDSRank::active_start()+0x135) [0x5585eb146be5]
 9: (MDSRankDispatcher::handle_mds_map(MMDSMap*, MDSMap*)+0x4e5) [0x5585eb151ea5]
 10: (MDSDaemon::handle_mds_map(MMDSMap*)+0xca8) [0x5585eb134608]
 11: (MDSDaemon::handle_core_message(Message*)+0x6c) [0x5585eb138bbc]
 12: (MDSDaemon::ms_dispatch(Message*)+0xbb) [0x5585eb13929b]
 13: (DispatchQueue::entry()+0xb92) [0x7ff775a56e52]
 14: (DispatchQueue::DispatchThread::entry()+0xd) [0x7ff775af3e2d]
 15: (()+0x76db) [0x7ff7752846db]
 16: (clone()+0x3f) [0x7ff77446a88f]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Quoting Yan, Zheng (ukernel@xxxxxxxxx):

> Please try below patch if you can compile ceph from source.  If you
> can't compile ceph or the issue still happens, please set  debug_mds =
> 10 for standby mds (change debug_mds to 0 after mds becomes active).
> 
> Regards
> Yan, Zheng
> 
> diff --git a/src/mds/MDSRank.cc b/src/mds/MDSRank.cc
> index 1e8b024b8a..d1150578f1 100644
> --- a/src/mds/MDSRank.cc
> +++ b/src/mds/MDSRank.cc
> @@ -1454,8 +1454,8 @@ void MDSRank::rejoin_done()
>  void MDSRank::clientreplay_start()
>  {
>    dout(1) << "clientreplay_start" << dendl;
> -  finish_contexts(g_ceph_context, waiting_for_replay);  // kick waiters
>    mdcache->start_files_to_recover();
> +  finish_contexts(g_ceph_context, waiting_for_replay);  // kick waiters
>    queue_one_replay();
>  }
> 
> @@ -1487,8 +1487,8 @@ void MDSRank::active_start()
> 
>    mdcache->clean_open_file_lists();
>    mdcache->export_remaining_imported_caps();
> -  finish_contexts(g_ceph_context, waiting_for_replay);  // kick waiters
>    mdcache->start_files_to_recover();
> +  finish_contexts(g_ceph_context, waiting_for_replay);  // kick waiters
> 
>    mdcache->reissue_all_caps();
>    mdcache->activate_stray_manager();

AFAICT this patch has never been tested and never commited. Do you still think
this might fix the issue? Any hints on how we might reproduce this issue:
failing active mds and hitting this specific recovery scenario

We will happily apply this patch and do testing to check if it really fixes the
issue.

Gr. Stefan

P.s. For my understanding: the MDS should never stop responding by setting
these parameters, right?

-- 
| BIT BV  https://www.bit.nl/        Kamer van Koophandel 09090351
| GPG: 0xD14839C6                   +31 318 648 688 / info@xxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx