Re: [Ceph-community] Recover ceph-mon from 'Segfault on startup'

Sage Weil <sage@xxxxxxxxxxxx> · Mon, 6 Aug 2018 15:02:06 +0000 (UTC)

On Mon, 6 Aug 2018, Brayan Perera wrote:
> Hi Sage,
> 
> Please find the attachment.

You need to wait until these builds complete, 
https://shaman.ceph.com/repos/ceph/wip-rmap-thing/520b28b2e461e2fdb317b1566f48ae844c08cb43/, 
then install on one of the mon hosts, and then run teh command in order to 
see the new debug output.  (Or apply the patch locally.)

Thanks!
sage

> 
> On Mon, Aug 6, 2018 at 7:54 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
> > On Mon, 6 Aug 2018, Brayan Perera wrote:
> >> Hi Sage,
> >>
> >> This is happened on production ceph cluster. Ceph mons were suddenly went
> >> offline.
> >>
> >> When we check the logs while restarting the ceph-mon services we got above
> >> error. Complete log entry is attached as 'ceph_mon_start_error.txt'.
> >>
> >>
> >> Since issue started on one of the test pools. ()
> >>
> >>
> >> Pool is having size as 1. But the associated crush rule is having min size
> >> set to 2.  Can this cause the issue ?
> >>
> >>
> >> As a workaround we we trying to change the crush rule and inject the crush
> >> map. But still crush map is not getting changed.
> >>
> >>
> >> /usr/bin/ceph-monstore-tool /var/lib/ceph/mon/ceph-cephmon rewrite-crush --
> >> --crush ./osdmap_crush.updated_compiled
> >>
> >>
> >> Do we have an option to change pool size in the osdmap directly inside
> >> ceph-mon ?
> >
> > I would not mess with the monitor's osdmap until we understand what is
> > wrong.  The sequence you describe shouldn't have caused a crash (and a
> > quick attempt to reproduce this failed).
> >
> > I pushed a branch, https://github.com/ceph/ceph-ci/commit/wip-rmap-thing,
> > with some debug output.  This will build at shaman.ceph.com shortly.  Do
> > you mind running this on one of your mons with
> >
> >  ceph-mon -i `hostname` -d 2>&1 | tee output
> >
> > and capturing the output?  It looks like an out-of-bounds access on
> > acting_rmap but it's not clear to me how that can happen.
> >
> > Thanks!
> > sage
> >
> >
> >  >
> >>
> >> Thanks & Regards,
> >> Brayan
> >>
> >>
> >>
> >>
> >> On Mon, Aug 6, 2018 at 6:50 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
> >>
> >> > Moving to ceph-devel.
> >> >
> >> > On Mon, 6 Aug 2018, Brayan Perera wrote:
> >> > > Dear All,
> >> > >
> >> > > We have encounter an issue on ceph mons, where all the ceph-mon instances
> >> > > went offline with following error.
> >> > >
> >> > >
> >> > > =======================
> >> > >      0> 2018-08-06 18:20:56.373266 7f9e4b98d700 -1 *** Caught signal
> >> > > (Segmentation fault) **
> >> > >  in thread 7f9e4b98d700 thread_name:cpu_tp
> >> > >
> >> > >  ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) luminous
> >> > > (stable)
> >> > >  1: (()+0x8f51b1) [0x560c971ab1b1]
> >> > >  2: (()+0xf6d0) [0x7f9e541106d0]
> >> > >  3: (void std::vector<pg_t,
> >> > > mempool::pool_allocator<(mempool::pool_index_t)16, pg_t>
> >> > > >::_M_emplace_back_aux<pg_t const&>(pg_t const&)+0x69) [0x560c96fb5fd9]
> >> > >  4: (OSDMapMapping::_build_rmap(OSDMap const&)+0x209) [0x560c96fb5389]
> >> > >  5: (OSDMapMapping::_finish(OSDMap const&)+0x11) [0x560c96fb53c1]
> >> > >  6: (ParallelPGMapper::Job::finish_one()+0x82) [0x560c96fb4332]
> >> > >  7: (ParallelPGMapper::WQ::_process(ParallelPGMapper::Item*,
> >> > > ThreadPool::TPHandle&)+0x7f) [0x560c96fb43ff]
> >> > >  8: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa8e) [0x560c96eaf59e]
> >> > >  9: (ThreadPool::WorkThread::entry()+0x10) [0x560c96eb0480]
> >> > >  10: (()+0x7e25) [0x7f9e54108e25]
> >> > >  11: (clone()+0x6d) [0x7f9e512febad]
> >> > >  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
> >> > > to interpret this.
> >> > > ============================
> >> >
> >> > I haven't seen this one before.  I take it it happens reliably when you
> >> > try to start the OSD?  Can you reproduce the bug and generate a mon log
> >> > file by setting 'debug mon = 20'?
> >> >
> >> > Thanks!
> >> > sage
> >> >
> >> > >
> >> > >
> >> > > Before this we have change one of the test pools size to 1 from 3 for
> >> > > testing purpose.
> >> > >
> >> > > Is there a way to recover this without loosing the data ?
> >> > >
> >> > >
> >> > > Thanks in advance.
> >> > >
> >> > > Regards,
> >> > > Brayan
> >> > >
> >> > >
> >> > >
> >> > >
> >> > >
> >> > >
> >> > > --
> >> > > Best Regards,
> >> > > Brayan Perera
> >> > >
> >> >
> >>
> >>
> >>
> >> --
> >> Best Regards,
> >> Brayan Perera
> >>
> 
> 
> 
> -- 
> Best Regards,
> Brayan Perera
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html