Recovery from "FAILED assert(omap_num_objs <= MAX_OBJECTS)"

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



We have run in to what looks like bug 36094 (https://tracker.ceph.com/issues/36094) on our 13.2.6 cluster and unfortunately now one of our ranks (Rank 1) won't start - it comes up for a few seconds before the assigned MDS crashes again with the below log entries. It would appear that OpenFileTable has somehow become corrupted, but it's not clear from any of the Ceph tool documentation if there is any way of clearing this.

Before we resort to deleting and recreating the cluster, are there any further recovery steps we can perform?

Thanks.

2019-08-27 16:10:50.775 7f2c94581700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.6/rpm/el7/BUILD/ceph-13.2.6/src/mds/OpenFileTable.cc: In function 'void OpenFileTable::commit(MDSInternalContextBase*, uint64_t, int)' thread 7f2c94581700 time 2019-08-27 16:10:50.774858 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.6/rpm/el7/BUILD/ceph-13.2.6/src/mds/OpenFileTable.cc: 473: FAILED assert(omap_num_objs <= MAX_OBJECTS)

 ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic (stable)  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14b) [0x7f2ca064636b]
 2: (()+0x26e4f7) [0x7f2ca06464f7]
 3: (OpenFileTable::commit(MDSInternalContextBase*, unsigned long, int)+0x1b35) [0x557afbe49265]
 4: (MDLog::trim(int)+0x5a6) [0x557afbe36a86]
 5: (MDSRankDispatcher::tick()+0x24b) [0x557afbbcd97b]
 6: (FunctionContext::finish(int)+0x2c) [0x557afbbb326c]
 7: (Context::complete(int)+0x9) [0x557afbbb0ef9]
 8: (SafeTimer::timer_thread()+0x18b) [0x7f2ca0642c3b]
 9: (SafeTimerThread::entry()+0xd) [0x7f2ca06441fd]
 10: (()+0x7dd5) [0x7f2c9e284dd5]
 11: (clone()+0x6d) [0x7f2c9d36202d]

2019-08-27 16:10:50.777 7f2c94581700 -1 *** Caught signal (Aborted) **
 in thread 7f2c94581700 thread_name:safe_timer

 ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic (stable)
 1: (()+0xf5d0) [0x7f2c9e28c5d0]
 2: (gsignal()+0x37) [0x7f2c9d29a2c7]
 3: (abort()+0x148) [0x7f2c9d29b9b8]
 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x248) [0x7f2ca0646468]
 5: (()+0x26e4f7) [0x7f2ca06464f7]
 6: (OpenFileTable::commit(MDSInternalContextBase*, unsigned long, int)+0x1b35) [0x557afbe49265]
 7: (MDLog::trim(int)+0x5a6) [0x557afbe36a86]
 8: (MDSRankDispatcher::tick()+0x24b) [0x557afbbcd97b]
 9: (FunctionContext::finish(int)+0x2c) [0x557afbbb326c]
 10: (Context::complete(int)+0x9) [0x557afbbb0ef9]
 11: (SafeTimer::timer_thread()+0x18b) [0x7f2ca0642c3b]
 12: (SafeTimerThread::entry()+0xd) [0x7f2ca06441fd]
 13: (()+0x7dd5) [0x7f2c9e284dd5]
 14: (clone()+0x6d) [0x7f2c9d36202d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.


_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux