Re: MDS crash - FAILED assert(omap_num_objs <= MAX_OBJECTS)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

Quoting Stefan Kooman (stefan@xxxxxx):

> > please apply following patch, thanks.
> > 
> > diff --git a/src/mds/OpenFileTable.cc b/src/mds/OpenFileTable.cc
> > index c0f72d581d..2ca737470d 100644
> > --- a/src/mds/OpenFileTable.cc
> > +++ b/src/mds/OpenFileTable.cc
> > @@ -470,7 +470,11 @@ void OpenFileTable::commit(MDSInternalContextBase *c,
> > uint64_t log_seq, int op_p
> >   }
> >   if (omap_idx < 0) {
> >     ++omap_num_objs;
> > -   assert(omap_num_objs <= MAX_OBJECTS);
> > +   if (omap_num_objs > MAX_OBJECTS) {
> > +     dout(1) << "omap_num_objs " << omap_num_objs << dendl;
> > +     dout(1) << "anchor_map size " << anchor_map.size() << dendl;
> > +     assert(omap_num_objs <= MAX_OBJECTS);
> > +   }
> >     omap_num_items.resize(omap_num_objs);
> >     omap_updates.resize(omap_num_objs);
> >     omap_updates.back().clear = true;
> 
> It took a while but an MDS server with this debug patch is now live (and
> up:active).

.... and it crashed again (and again) ... until we stopped the mds and
deleted the mds0_openfiles.0 from the metadata pool.

Here is the (debug) output:

2019-12-04 06:25:01.578 7f6200248700 -1 received  signal: Hangup from pkill -1 -x ceph-mon|ceph-mgr|ceph-mds|ceph-osd|ceph-fuse|radosgw  (PID: 3491) UID: 0
2019-12-04 20:19:58.043 7f61fc859700  0 mds.0.openfiles omap_num_objs 1025
2019-12-04 20:19:58.043 7f61fc859700  0 mds.0.openfiles anchor_map size 4417650
2019-12-04 20:19:58.043 7f61fc859700 -1 /build/ceph-13.2.6/src/mds/OpenFileTable.cc: In function 'void OpenFileTable::commit(MDSInternalContextBase*, uint64_t, int)' thread 7f61fc859700 time 2019-12-04 20:19:58.045875
/build/ceph-13.2.6/src/mds/OpenFileTable.cc: 476: FAILED assert(omap_num_objs <= MAX_OBJECTS)

 ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14e) [0x7f6207d01b5e]
 2: (()+0x2c4cb7) [0x7f6207d01cb7]
 3: (OpenFileTable::commit(MDSInternalContextBase*, unsigned long, int)+0x1c5f) [0x55e38662566f]
 4: (MDLog::trim(int)+0x5a6) [0x55e386614666]
 5: (MDSRankDispatcher::tick()+0x24b) [0x55e3863a637b]
 6: (FunctionContext::finish(int)+0x2c) [0x55e38638b51c]
 7: (Context::complete(int)+0x9) [0x55e3863894b9]
 8: (SafeTimer::timer_thread()+0xf9) [0x7f6207cfe329]
 9: (SafeTimerThread::entry()+0xd) [0x7f6207cffa3d]
 10: (()+0x76db) [0x7f62075b56db]
 11: (clone()+0x3f) [0x7f620679b88f]

2019-12-04 20:19:58.043 7f61fc859700 -1 *** Caught signal (Aborted) **
 in thread 7f61fc859700 thread_name:safe_timer

 ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic (stable)
 1: (()+0x12890) [0x7f62075c0890]
 2: (gsignal()+0xc7) [0x7f62066b8e97]
 3: (abort()+0x141) [0x7f62066ba801]
 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x25f) [0x7f6207d01c6f]
 5: (()+0x2c4cb7) [0x7f6207d01cb7]
 6: (OpenFileTable::commit(MDSInternalContextBase*, unsigned long, int)+0x1c5f) [0x55e38662566f]
 7: (MDLog::trim(int)+0x5a6) [0x55e386614666]
 8: (MDSRankDispatcher::tick()+0x24b) [0x55e3863a637b]
 9: (FunctionContext::finish(int)+0x2c) [0x55e38638b51c]
 10: (Context::complete(int)+0x9) [0x55e3863894b9]
 11: (SafeTimer::timer_thread()+0xf9) [0x7f6207cfe329]
 12: (SafeTimerThread::entry()+0xd) [0x7f6207cffa3d]
 13: (()+0x76db) [0x7f62075b56db]
 14: (clone()+0x3f) [0x7f620679b88f]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

A specific workload that *might* have triggered this: recursively deleting a long
list of files and directories (~ 7 milion in total) with 5 "rm" processes
in parallel ...

Gr. Stefan

-- 
| BIT BV  https://www.bit.nl/        Kamer van Koophandel 09090351
| GPG: 0xD14839C6                   +31 318 648 688 / info@xxxxxx
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux