Hi, Quoting Stefan Kooman (stefan@xxxxxx): > > please apply following patch, thanks. > > > > diff --git a/src/mds/OpenFileTable.cc b/src/mds/OpenFileTable.cc > > index c0f72d581d..2ca737470d 100644 > > --- a/src/mds/OpenFileTable.cc > > +++ b/src/mds/OpenFileTable.cc > > @@ -470,7 +470,11 @@ void OpenFileTable::commit(MDSInternalContextBase *c, > > uint64_t log_seq, int op_p > > } > > if (omap_idx < 0) { > > ++omap_num_objs; > > - assert(omap_num_objs <= MAX_OBJECTS); > > + if (omap_num_objs > MAX_OBJECTS) { > > + dout(1) << "omap_num_objs " << omap_num_objs << dendl; > > + dout(1) << "anchor_map size " << anchor_map.size() << dendl; > > + assert(omap_num_objs <= MAX_OBJECTS); > > + } > > omap_num_items.resize(omap_num_objs); > > omap_updates.resize(omap_num_objs); > > omap_updates.back().clear = true; > > It took a while but an MDS server with this debug patch is now live (and > up:active). .... and it crashed again (and again) ... until we stopped the mds and deleted the mds0_openfiles.0 from the metadata pool. Here is the (debug) output: 2019-12-04 06:25:01.578 7f6200248700 -1 received signal: Hangup from pkill -1 -x ceph-mon|ceph-mgr|ceph-mds|ceph-osd|ceph-fuse|radosgw (PID: 3491) UID: 0 2019-12-04 20:19:58.043 7f61fc859700 0 mds.0.openfiles omap_num_objs 1025 2019-12-04 20:19:58.043 7f61fc859700 0 mds.0.openfiles anchor_map size 4417650 2019-12-04 20:19:58.043 7f61fc859700 -1 /build/ceph-13.2.6/src/mds/OpenFileTable.cc: In function 'void OpenFileTable::commit(MDSInternalContextBase*, uint64_t, int)' thread 7f61fc859700 time 2019-12-04 20:19:58.045875 /build/ceph-13.2.6/src/mds/OpenFileTable.cc: 476: FAILED assert(omap_num_objs <= MAX_OBJECTS) ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14e) [0x7f6207d01b5e] 2: (()+0x2c4cb7) [0x7f6207d01cb7] 3: (OpenFileTable::commit(MDSInternalContextBase*, unsigned long, int)+0x1c5f) [0x55e38662566f] 4: (MDLog::trim(int)+0x5a6) [0x55e386614666] 5: (MDSRankDispatcher::tick()+0x24b) [0x55e3863a637b] 6: (FunctionContext::finish(int)+0x2c) [0x55e38638b51c] 7: (Context::complete(int)+0x9) [0x55e3863894b9] 8: (SafeTimer::timer_thread()+0xf9) [0x7f6207cfe329] 9: (SafeTimerThread::entry()+0xd) [0x7f6207cffa3d] 10: (()+0x76db) [0x7f62075b56db] 11: (clone()+0x3f) [0x7f620679b88f] 2019-12-04 20:19:58.043 7f61fc859700 -1 *** Caught signal (Aborted) ** in thread 7f61fc859700 thread_name:safe_timer ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic (stable) 1: (()+0x12890) [0x7f62075c0890] 2: (gsignal()+0xc7) [0x7f62066b8e97] 3: (abort()+0x141) [0x7f62066ba801] 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x25f) [0x7f6207d01c6f] 5: (()+0x2c4cb7) [0x7f6207d01cb7] 6: (OpenFileTable::commit(MDSInternalContextBase*, unsigned long, int)+0x1c5f) [0x55e38662566f] 7: (MDLog::trim(int)+0x5a6) [0x55e386614666] 8: (MDSRankDispatcher::tick()+0x24b) [0x55e3863a637b] 9: (FunctionContext::finish(int)+0x2c) [0x55e38638b51c] 10: (Context::complete(int)+0x9) [0x55e3863894b9] 11: (SafeTimer::timer_thread()+0xf9) [0x7f6207cfe329] 12: (SafeTimerThread::entry()+0xd) [0x7f6207cffa3d] 13: (()+0x76db) [0x7f62075b56db] 14: (clone()+0x3f) [0x7f620679b88f] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. A specific workload that *might* have triggered this: recursively deleting a long list of files and directories (~ 7 milion in total) with 5 "rm" processes in parallel ... Gr. Stefan -- | BIT BV https://www.bit.nl/ Kamer van Koophandel 09090351 | GPG: 0xD14839C6 +31 318 648 688 / info@xxxxxx _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com