Dear list, Today our active MDS crashed with an assert: 2019-10-19 08:14:50.645 7f7906cb7700 -1 /build/ceph-13.2.6/src/mds/OpenFileTable.cc: In function 'void OpenFileTable::commit(MDSInternalContextBase*, uint64_t, int)' thread 7f7906cb7700 time 2019-10-19 08:14:50.648559 /build/ceph-13.2.6/src/mds/OpenFileTable.cc: 473: FAILED assert(omap_num_objs <= MAX_OBJECTS) ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14e) [0x7f7911b2897e] 2: (()+0x2fab07) [0x7f7911b28b07] 3: (OpenFileTable::commit(MDSInternalContextBase*, unsigned long, int)+0x1b27) [0x7703f7] 4: (MDLog::trim(int)+0x5a6) [0x75dcd6] 5: (MDSRankDispatcher::tick()+0x24b) [0x4f013b] 6: (FunctionContext::finish(int)+0x2c) [0x4d52dc] 7: (Context::complete(int)+0x9) [0x4d31d9] 8: (SafeTimer::timer_thread()+0x18b) [0x7f7911b2520b] 9: (SafeTimerThread::entry()+0xd) [0x7f7911b2686d] 10: (()+0x76ba) [0x7f79113a76ba] 11: (clone()+0x6d) [0x7f7910bd041d] 2019-10-19 08:14:50.649 7f7906cb7700 -1 *** Caught signal (Aborted) ** in thread 7f7906cb7700 thread_name:safe_timer ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic (stable) 1: (()+0x11390) [0x7f79113b1390] 2: (gsignal()+0x38) [0x7f7910afe428] 3: (abort()+0x16a) [0x7f7910b0002a] 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x256) [0x7f7911b28a86] 5: (()+0x2fab07) [0x7f7911b28b07] 6: (OpenFileTable::commit(MDSInternalContextBase*, unsigned long, int)+0x1b27) [0x7703f7] 7: (MDLog::trim(int)+0x5a6) [0x75dcd6] 8: (MDSRankDispatcher::tick()+0x24b) [0x4f013b] 9: (FunctionContext::finish(int)+0x2c) [0x4d52dc] 10: (Context::complete(int)+0x9) [0x4d31d9] 11: (SafeTimer::timer_thread()+0x18b) [0x7f7911b2520b] 12: (SafeTimerThread::entry()+0xd) [0x7f7911b2686d] 13: (()+0x76ba) [0x7f79113a76ba] 14: (clone()+0x6d) [0x7f7910bd041d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. Apparently this is bug 36094 (https://tracker.ceph.com/issues/36094). Our active MDS had mds_cache_memory_limit=150G and ~ 27 M CAPS handed out to 78 clients. A few of them having many milions of CAPS. This resulted in laggy MDS ... another failover ... until the MDS was finally able to cope with the load. We adjusted mds_cache_memory_limit to 32G right after that and activated the new limit: ceph tell mds.* config set mds_cache_memory_limit 34359738368 Double checked it was set correctly, and monitored mem usage. That all went fine. Around # 6 M CAPS in use (2 clients used 5/6 of those). After ~ 5 yours the same assert was hit. Fortunately the failover was way faster now ... but then the, now active MDS, hit the same assert again triggering another failover ... other MDS took over and failed again ... the other took over and cephfs healthy again ... The bug report does not hint on how to prevent this situation. Recently Zoë'Connell hit the same issue on a Mimic 13.2.6 system: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-August/036702.html I wonder if this situation is more likely to be hit on Mimic 13.2.6 than on any other system. Any hints / help to prevent this from happening? Thanks, Stefan -- | BIT BV https://www.bit.nl/ Kamer van Koophandel 09090351 | GPG: 0xD14839C6 +31 318 648 688 / info@xxxxxx _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com