We have run in to what looks like bug 36094
(https://tracker.ceph.com/issues/36094) on our 13.2.6 cluster and
unfortunately now one of our ranks (Rank 1) won't start - it comes up
for a few seconds before the assigned MDS crashes again with the below
log entries. It would appear that OpenFileTable has somehow become
corrupted, but it's not clear from any of the Ceph tool documentation if
there is any way of clearing this.
Before we resort to deleting and recreating the cluster, are there any
further recovery steps we can perform?
Thanks.
2019-08-27 16:10:50.775 7f2c94581700 -1
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.6/rpm/el7/BUILD/ceph-13.2.6/src/mds/OpenFileTable.cc:
In function 'void OpenFileTable::commit(MDSInternalContextBase*,
uint64_t, int)' thread 7f2c94581700 time 2019-08-27 16:10:50.774858
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.6/rpm/el7/BUILD/ceph-13.2.6/src/mds/OpenFileTable.cc:
473: FAILED assert(omap_num_objs <= MAX_OBJECTS)
ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic
(stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x14b) [0x7f2ca064636b]
2: (()+0x26e4f7) [0x7f2ca06464f7]
3: (OpenFileTable::commit(MDSInternalContextBase*, unsigned long,
int)+0x1b35) [0x557afbe49265]
4: (MDLog::trim(int)+0x5a6) [0x557afbe36a86]
5: (MDSRankDispatcher::tick()+0x24b) [0x557afbbcd97b]
6: (FunctionContext::finish(int)+0x2c) [0x557afbbb326c]
7: (Context::complete(int)+0x9) [0x557afbbb0ef9]
8: (SafeTimer::timer_thread()+0x18b) [0x7f2ca0642c3b]
9: (SafeTimerThread::entry()+0xd) [0x7f2ca06441fd]
10: (()+0x7dd5) [0x7f2c9e284dd5]
11: (clone()+0x6d) [0x7f2c9d36202d]
2019-08-27 16:10:50.777 7f2c94581700 -1 *** Caught signal (Aborted) **
in thread 7f2c94581700 thread_name:safe_timer
ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic
(stable)
1: (()+0xf5d0) [0x7f2c9e28c5d0]
2: (gsignal()+0x37) [0x7f2c9d29a2c7]
3: (abort()+0x148) [0x7f2c9d29b9b8]
4: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x248) [0x7f2ca0646468]
5: (()+0x26e4f7) [0x7f2ca06464f7]
6: (OpenFileTable::commit(MDSInternalContextBase*, unsigned long,
int)+0x1b35) [0x557afbe49265]
7: (MDLog::trim(int)+0x5a6) [0x557afbe36a86]
8: (MDSRankDispatcher::tick()+0x24b) [0x557afbbcd97b]
9: (FunctionContext::finish(int)+0x2c) [0x557afbbb326c]
10: (Context::complete(int)+0x9) [0x557afbbb0ef9]
11: (SafeTimer::timer_thread()+0x18b) [0x7f2ca0642c3b]
12: (SafeTimerThread::entry()+0xd) [0x7f2ca06441fd]
13: (()+0x7dd5) [0x7f2c9e284dd5]
14: (clone()+0x6d) [0x7f2c9d36202d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com