On 2/6/20 6:04 PM, Stefan Kooman wrote:
Hi,
After setting:
ceph config set mds mds_recall_max_caps 10000
(5000 before change)
and
ceph config set mds mds_recall_max_decay_rate 1.0
(2.5 before change)
And the:
ceph tell 'mds.*' injectargs '--mds_recall_max_caps 10000'
ceph tell 'mds.*' injectargs '--mds_recall_max_decay_rate 1.0'
our up:active MDS stopped responding and the standby-replay stepped in
... and hit an assert (same as in this thread):
2020-02-06 16:42:16.712 7ff76a528700 1 heartbeat_map reset_timeout 'MDSRank' had timed out after 15
2020-02-06 16:42:17.616 7ff76ff1b700 0 mds.beacon.mds2 MDS is no longer laggy
2020-02-06 16:42:20.348 7ff76d716700 -1 /build/ceph-13.2.8/src/mds/Locker.cc: In function 'void Locker::file_recover(ScatterLock*)' thread 7ff76d716700 time 2020-02-06 16:42:20.351124
/build/ceph-13.2.8/src/mds/Locker.cc: 5307: FAILED assert(lock->get_state() == LOCK_PRE_SCAN)
ceph version 13.2.8 (5579a94fafbc1f9cc913a0f5d362953a5d9c3ae0) mimic (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14e) [0x7ff7759939de]
2: (()+0x287b67) [0x7ff775993b67]
3: (()+0x28a9ea) [0x5585eb2b79ea]
4: (MDCache::start_files_to_recover()+0xbb) [0x5585eb1f897b]
5: (MDSRank::active_start()+0x135) [0x5585eb146be5]
6: (MDSRankDispatcher::handle_mds_map(MMDSMap*, MDSMap*)+0x4e5) [0x5585eb151ea5]
7: (MDSDaemon::handle_mds_map(MMDSMap*)+0xca8) [0x5585eb134608]
8: (MDSDaemon::handle_core_message(Message*)+0x6c) [0x5585eb138bbc]
9: (MDSDaemon::ms_dispatch(Message*)+0xbb) [0x5585eb13929b]
10: (DispatchQueue::entry()+0xb92) [0x7ff775a56e52]
11: (DispatchQueue::DispatchThread::entry()+0xd) [0x7ff775af3e2d]
12: (()+0x76db) [0x7ff7752846db]
13: (clone()+0x3f) [0x7ff77446a88f]
2020-02-06 16:42:20.348 7ff76d716700 -1 *** Caught signal (Aborted) **
in thread 7ff76d716700 thread_name:ms_dispatch
ceph version 13.2.8 (5579a94fafbc1f9cc913a0f5d362953a5d9c3ae0) mimic (stable)
1: (()+0x12890) [0x7ff77528f890]
2: (gsignal()+0xc7) [0x7ff774387e97]
3: (abort()+0x141) [0x7ff774389801]
4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x256) [0x7ff775993ae6]
5: (()+0x287b67) [0x7ff775993b67]
6: (()+0x28a9ea) [0x5585eb2b79ea]
7: (MDCache::start_files_to_recover()+0xbb) [0x5585eb1f897b]
8: (MDSRank::active_start()+0x135) [0x5585eb146be5]
9: (MDSRankDispatcher::handle_mds_map(MMDSMap*, MDSMap*)+0x4e5) [0x5585eb151ea5]
10: (MDSDaemon::handle_mds_map(MMDSMap*)+0xca8) [0x5585eb134608]
11: (MDSDaemon::handle_core_message(Message*)+0x6c) [0x5585eb138bbc]
12: (MDSDaemon::ms_dispatch(Message*)+0xbb) [0x5585eb13929b]
13: (DispatchQueue::entry()+0xb92) [0x7ff775a56e52]
14: (DispatchQueue::DispatchThread::entry()+0xd) [0x7ff775af3e2d]
15: (()+0x76db) [0x7ff7752846db]
16: (clone()+0x3f) [0x7ff77446a88f]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Quoting Yan, Zheng (ukernel@xxxxxxxxx):
Please try below patch if you can compile ceph from source. If you
can't compile ceph or the issue still happens, please set debug_mds =
10 for standby mds (change debug_mds to 0 after mds becomes active).
Regards
Yan, Zheng
diff --git a/src/mds/MDSRank.cc b/src/mds/MDSRank.cc
index 1e8b024b8a..d1150578f1 100644
--- a/src/mds/MDSRank.cc
+++ b/src/mds/MDSRank.cc
@@ -1454,8 +1454,8 @@ void MDSRank::rejoin_done()
void MDSRank::clientreplay_start()
{
dout(1) << "clientreplay_start" << dendl;
- finish_contexts(g_ceph_context, waiting_for_replay); // kick waiters
mdcache->start_files_to_recover();
+ finish_contexts(g_ceph_context, waiting_for_replay); // kick waiters
queue_one_replay();
}
@@ -1487,8 +1487,8 @@ void MDSRank::active_start()
mdcache->clean_open_file_lists();
mdcache->export_remaining_imported_caps();
- finish_contexts(g_ceph_context, waiting_for_replay); // kick waiters
mdcache->start_files_to_recover();
+ finish_contexts(g_ceph_context, waiting_for_replay); // kick waiters
mdcache->reissue_all_caps();
mdcache->activate_stray_manager();
AFAICT this patch has never been tested and never commited. Do you still think
this might fix the issue? Any hints on how we might reproduce this issue:
failing active mds and hitting this specific recovery scenario
We will happily apply this patch and do testing to check if it really fixes the
issue.
Apparently there is a tracker for this
(https://tracker.ceph.com/issues/48096) and a backport for octopus:
https://tracker.ceph.com/issues/48096 and nautilus is on its way:
https://tracker.ceph.com/issues/48095
FYI,
Gr. Stefan
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx