Re: mds lost very frequently

"Yan, Zheng" <ukernel@xxxxxxxxx> · Fri, 14 Dec 2018 15:48:57 +0800

On Fri, Dec 14, 2018 at 12:05 PM Sang, Oliver <oliver.sang@xxxxxxxxx> wrote:
>
> Thanks a lot, Yan Zheng!
>
> I enabled only 2 MDS - node1(active) and node2. Then I modified ceph.conf of node2 to have -
> debug_mds = 10/10
>
> At 08:35:28, I observed degradation, the node1 was not a MDS any longer and node2 changed to active.
> To my surprise, during two minutes, more than 20M lines logs were generated and the mds log file became as huge as >6G.
> In order to send it as an attachment here, I copy first 8000 lines of the log after node2 become active. As attached.
> Is it enough? And could you suggest further steps? Thanks!
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> 2018-12-14 08:35:28.686043 7fbe3453a700  1 mds.lkp-ceph-node2 Updating MDS map to version 70345 from mon.0
> ....
> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
>

No crash in the log.  I need 'debug_mds=10' log at the time that mds crashes.

Yan, Zheng

> BR
> Oliver
>
> -----Original Message-----
> From: Yan, Zheng [mailto:ukernel@xxxxxxxxx]
> Sent: Thursday, December 13, 2018 9:38 PM
> To: Sang, Oliver <oliver.sang@xxxxxxxxx>
> Cc: ceph-users@xxxxxxxxxxxxxx; Li, Philip <philip.li@xxxxxxxxx>
> Subject: Re:  mds lost very frequently
>
> On Thu, Dec 13, 2018 at 9:25 PM Sang, Oliver <oliver.sang@xxxxxxxxx> wrote:
> >
> > Thanks a lot, Yan Zheng!
> >
> > Regarding the " set  debug_mds =10 for standby mds (change debug_mds to 0 after mds becomes active)."
> > Could you please explain the purpose? Just want to collect debug log, or it really has the side effect to prevent mds lost?
> >
> > Regarding the patch itself. Sorry we didn't compile from source.
> > However, may I ask whether it will be included in a new v12 release in
> > the future? Thanks
> >
>
> the crash happened during mds recovers. I want collect debug log during that period
>
> Regards
> Yan, Zheng
>
> > BR
> > Oliver
> >
> > -----Original Message-----
> > From: Yan, Zheng [mailto:ukernel@xxxxxxxxx]
> > Sent: Thursday, December 13, 2018 3:44 PM
> > To: Sang, Oliver <oliver.sang@xxxxxxxxx>
> > Cc: ceph-users@xxxxxxxxxxxxxx
> > Subject: Re:  mds lost very frequently
> >
> > On Thu, Dec 13, 2018 at 2:55 AM Sang, Oliver <oliver.sang@xxxxxxxxx> wrote:
> > >
> > > We are using luminous, we have seven ceph nodes and setup them all as MDS.
> > >
> > > Recently the MDS lost very frequently, and when there is only one MDS left, the cephfs just degraded to unusable.
> > >
> > >
> > >
> > > Checked the mds log in one ceph node, I found below
> > >
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >
> > > /build/ceph-12.2.8/src/mds/Locker.cc: 5076: FAILED
> > > assert(lock->get_state() == LOCK_PRE_SCAN)
> > >
> > >
> > >
> > > ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0)
> > > luminous (stable)
> > >
> > > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> > > const*)+0x102) [0x564400e50e42]
> > >
> > > 2: (Locker::file_recover(ScatterLock*)+0x208) [0x564400c6ae18]
> > >
> > > 3: (MDCache::start_files_to_recover()+0xb3) [0x564400b98af3]
> > >
> > > 4: (MDSRank::clientreplay_start()+0x1f7) [0x564400ae04c7]
> > >
> > > 5: (MDSRankDispatcher::handle_mds_map(MMDSMap*, MDSMap*)+0x25c0)
> > > [0x564400aefd40]
> > >
> > > 6: (MDSDaemon::handle_mds_map(MMDSMap*)+0x154d) [0x564400ace3bd]
> > >
> > > 7: (MDSDaemon::handle_core_message(Message*)+0x7f3) [0x564400ad1273]
> > >
> > > 8: (MDSDaemon::ms_dispatch(Message*)+0x1c3) [0x564400ad15a3]
> > >
> > > 9: (DispatchQueue::entry()+0xeda) [0x5644011a547a]
> > >
> > > 10: (DispatchQueue::DispatchThread::entry()+0xd) [0x564400ee3fcd]
> > >
> > > 11: (()+0x7494) [0x7f7a2b106494]
> > >
> > > 12: (clone()+0x3f) [0x7f7a2a17eaff]
> > >
> > > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
> > >
> > > <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
> > >
> > >
> > >
> > > The full log is also attached. Could you please help us? Thanks!
> > >
> > >
> >
> > Please try below patch if you can compile ceph from source.  If you
> > can't compile ceph or the issue still happens, please set  debug_mds =
> > 10 for standby mds (change debug_mds to 0 after mds becomes active).
> >
> > Regards
> > Yan, Zheng
> >
> > diff --git a/src/mds/MDSRank.cc b/src/mds/MDSRank.cc index
> > 1e8b024b8a..d1150578f1 100644
> > --- a/src/mds/MDSRank.cc
> > +++ b/src/mds/MDSRank.cc
> > @@ -1454,8 +1454,8 @@ void MDSRank::rejoin_done()  void MDSRank::clientreplay_start()  {
> >    dout(1) << "clientreplay_start" << dendl;
> > -  finish_contexts(g_ceph_context, waiting_for_replay);  // kick waiters
> >    mdcache->start_files_to_recover();
> > +  finish_contexts(g_ceph_context, waiting_for_replay);  // kick
> > + waiters
> >    queue_one_replay();
> >  }
> >
> > @@ -1487,8 +1487,8 @@ void MDSRank::active_start()
> >
> >    mdcache->clean_open_file_lists();
> >    mdcache->export_remaining_imported_caps();
> > -  finish_contexts(g_ceph_context, waiting_for_replay);  // kick waiters
> >    mdcache->start_files_to_recover();
> > +  finish_contexts(g_ceph_context, waiting_for_replay);  // kick
> > + waiters
> >
> >    mdcache->reissue_all_caps();
> >    mdcache->activate_stray_manager();
> >
> >
> >
> > >
> > > BR
> > >
> > > Oliver
> > >
> > >
> > >
> > > _______________________________________________
> > > ceph-users mailing list
> > > ceph-users@xxxxxxxxxxxxxx
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com