My cephfs FS recently went through a long recovery from losing some PGs and ODSs. It finally came back to "HEALTH_OK" for a bit, but then the MDS servers started crashing with this error in the logs:
I cannot get any of the 3 MDS servers to stay up now.
-313> 2019-07-11 17:42:39.820 7f612c147700 1 -- 10.10.30.116:6800/543707238 --> 10.10.30.115:6801/81746 -- mgrreport(unknown.ic2mon02 +0-0 packed 1374) v6 -- 0x2ed1c00 con 0
-313> 2019-07-11 17:42:39.820 7f612b946700 -1 /build/ceph-13.2.6/src/mds/MDCache.cc: In function 'void MDCache::journal_cow_dentry(MutationImpl*, EMetaBlob*, CDentry*, snapid_t, CInode**, CDentry::linkage_t*)' thread 7f612b946700 time 2019-07-11 17:42:39.820872
/build/ceph-13.2.6/src/mds/MDCache.cc: 1680: FAILED assert(follows >= realm->get_newest_seq())
ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14e) [0x7f61367b997e]
2: (()+0x2fab07) [0x7f61367b9b07]
3: (MDCache::journal_cow_dentry(MutationImpl*, EMetaBlob*, CDentry*, snapid_t, CInode**, CDentry::linkage_t*)+0xd3f) [0x5f821f]
4: (MDCache::journal_dirty_inode(MutationImpl*, EMetaBlob*, CInode*, snapid_t)+0xc0) [0x5f8450]
5: (MDCache::predirty_journal_parents(boost::intrusive_ptr<MutationImpl>, EMetaBlob*, CInode*, CDir*, int, int, snapid_t)+0x4b1) [0x5f9141]
6: (Locker::scatter_writebehind(ScatterLock*)+0x465) [0x64a615]
7: (Locker::simple_sync(SimpleLock*, bool*)+0x176) [0x64e506]
8: (Locker::scatter_nudge(ScatterLock*, MDSInternalContextBase*, bool)+0x3dd) [0x652f6d]
9: (Locker::scatter_tick()+0x1e4) [0x6535a4]
10: (Locker::tick()+0x9) [0x6538b9]
11: (MDSRankDispatcher::tick()+0x1e9) [0x4f00d9]
12: (FunctionContext::finish(int)+0x2c) [0x4d52dc]
13: (Context::complete(int)+0x9) [0x4d31d9]
14: (SafeTimer::timer_thread()+0x18b) [0x7f61367b620b]
15: (SafeTimerThread::entry()+0xd) [0x7f61367b786d]
16: (()+0x76ba) [0x7f61360356ba]
17: (clone()+0x6d) [0x7f613585e41d]
-313> 2019-07-11 17:42:39.820 7f612b946700 -1 *** Caught signal (Aborted) **
in thread 7f612b946700 thread_name:safe_timer
ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic (stable)
1: (()+0x11390) [0x7f613603f390]
2: (gsignal()+0x38) [0x7f613578c428]
3: (abort()+0x16a) [0x7f613578e02a]
4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x256) [0x7f61367b9a86]
5: (()+0x2fab07) [0x7f61367b9b07]
6: (MDCache::journal_cow_dentry(MutationImpl*, EMetaBlob*, CDentry*, snapid_t, CInode**, CDentry::linkage_t*)+0xd3f) [0x5f821f]
7: (MDCache::journal_dirty_inode(MutationImpl*, EMetaBlob*, CInode*, snapid_t)+0xc0) [0x5f8450]
8: (MDCache::predirty_journal_parents(boost::intrusive_ptr<MutationImpl>, EMetaBlob*, CInode*, CDir*, int, int, snapid_t)+0x4b1) [0x5f9141]
9: (Locker::scatter_writebehind(ScatterLock*)+0x465) [0x64a615]
10: (Locker::simple_sync(SimpleLock*, bool*)+0x176) [0x64e506]
11: (Locker::scatter_nudge(ScatterLock*, MDSInternalContextBase*, bool)+0x3dd) [0x652f6d]
12: (Locker::scatter_tick()+0x1e4) [0x6535a4]
13: (Locker::tick()+0x9) [0x6538b9]
14: (MDSRankDispatcher::tick()+0x1e9) [0x4f00d9]
15: (FunctionContext::finish(int)+0x2c) [0x4d52dc]
16: (Context::complete(int)+0x9) [0x4d31d9]
17: (SafeTimer::timer_thread()+0x18b) [0x7f61367b620b]
18: (SafeTimerThread::entry()+0xd) [0x7f61367b786d]
19: (()+0x76ba) [0x7f61360356ba]
20: (clone()+0x6d) [0x7f613585e41d]
-313> 2019-07-11 17:42:39.820 7f612b946700 -1 /build/ceph-13.2.6/src/mds/MDCache.cc: In function 'void MDCache::journal_cow_dentry(MutationImpl*, EMetaBlob*, CDentry*, snapid_t, CInode**, CDentry::linkage_t*)' thread 7f612b946700 time 2019-07-11 17:42:39.820872
/build/ceph-13.2.6/src/mds/MDCache.cc: 1680: FAILED assert(follows >= realm->get_newest_seq())
ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14e) [0x7f61367b997e]
2: (()+0x2fab07) [0x7f61367b9b07]
3: (MDCache::journal_cow_dentry(MutationImpl*, EMetaBlob*, CDentry*, snapid_t, CInode**, CDentry::linkage_t*)+0xd3f) [0x5f821f]
4: (MDCache::journal_dirty_inode(MutationImpl*, EMetaBlob*, CInode*, snapid_t)+0xc0) [0x5f8450]
5: (MDCache::predirty_journal_parents(boost::intrusive_ptr<MutationImpl>, EMetaBlob*, CInode*, CDir*, int, int, snapid_t)+0x4b1) [0x5f9141]
6: (Locker::scatter_writebehind(ScatterLock*)+0x465) [0x64a615]
7: (Locker::simple_sync(SimpleLock*, bool*)+0x176) [0x64e506]
8: (Locker::scatter_nudge(ScatterLock*, MDSInternalContextBase*, bool)+0x3dd) [0x652f6d]
9: (Locker::scatter_tick()+0x1e4) [0x6535a4]
10: (Locker::tick()+0x9) [0x6538b9]
11: (MDSRankDispatcher::tick()+0x1e9) [0x4f00d9]
12: (FunctionContext::finish(int)+0x2c) [0x4d52dc]
13: (Context::complete(int)+0x9) [0x4d31d9]
14: (SafeTimer::timer_thread()+0x18b) [0x7f61367b620b]
15: (SafeTimerThread::entry()+0xd) [0x7f61367b786d]
16: (()+0x76ba) [0x7f61360356ba]
17: (clone()+0x6d) [0x7f613585e41d]
-313> 2019-07-11 17:42:39.820 7f612b946700 -1 *** Caught signal (Aborted) **
in thread 7f612b946700 thread_name:safe_timer
ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic (stable)
1: (()+0x11390) [0x7f613603f390]
2: (gsignal()+0x38) [0x7f613578c428]
3: (abort()+0x16a) [0x7f613578e02a]
4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x256) [0x7f61367b9a86]
5: (()+0x2fab07) [0x7f61367b9b07]
6: (MDCache::journal_cow_dentry(MutationImpl*, EMetaBlob*, CDentry*, snapid_t, CInode**, CDentry::linkage_t*)+0xd3f) [0x5f821f]
7: (MDCache::journal_dirty_inode(MutationImpl*, EMetaBlob*, CInode*, snapid_t)+0xc0) [0x5f8450]
8: (MDCache::predirty_journal_parents(boost::intrusive_ptr<MutationImpl>, EMetaBlob*, CInode*, CDir*, int, int, snapid_t)+0x4b1) [0x5f9141]
9: (Locker::scatter_writebehind(ScatterLock*)+0x465) [0x64a615]
10: (Locker::simple_sync(SimpleLock*, bool*)+0x176) [0x64e506]
11: (Locker::scatter_nudge(ScatterLock*, MDSInternalContextBase*, bool)+0x3dd) [0x652f6d]
12: (Locker::scatter_tick()+0x1e4) [0x6535a4]
13: (Locker::tick()+0x9) [0x6538b9]
14: (MDSRankDispatcher::tick()+0x1e9) [0x4f00d9]
15: (FunctionContext::finish(int)+0x2c) [0x4d52dc]
16: (Context::complete(int)+0x9) [0x4d31d9]
17: (SafeTimer::timer_thread()+0x18b) [0x7f61367b620b]
18: (SafeTimerThread::entry()+0xd) [0x7f61367b786d]
19: (()+0x76ba) [0x7f61360356ba]
20: (clone()+0x6d) [0x7f613585e41d]
_______________________________________________ Dev mailing list -- dev@xxxxxxx To unsubscribe send an email to dev-leave@xxxxxxx