Re: [Ceph-community] Recover ceph-mon from 'Segfault on startup'

Sage Weil <sage@xxxxxxxxxxxx> · Mon, 6 Aug 2018 14:24:41 +0000 (UTC)

On Mon, 6 Aug 2018, Brayan Perera wrote:
> Hi Sage,
> 
> This is happened on production ceph cluster. Ceph mons were suddenly went
> offline.
> 
> When we check the logs while restarting the ceph-mon services we got above
> error. Complete log entry is attached as 'ceph_mon_start_error.txt'.
> 
> 
> Since issue started on one of the test pools. ()
> 
> 
> Pool is having size as 1. But the associated crush rule is having min size
> set to 2.  Can this cause the issue ?
> 
> 
> As a workaround we we trying to change the crush rule and inject the crush
> map. But still crush map is not getting changed.
> 
> 
> /usr/bin/ceph-monstore-tool /var/lib/ceph/mon/ceph-cephmon rewrite-crush --
> --crush ./osdmap_crush.updated_compiled
> 
> 
> Do we have an option to change pool size in the osdmap directly inside
> ceph-mon ?

I would not mess with the monitor's osdmap until we understand what is 
wrong.  The sequence you describe shouldn't have caused a crash (and a 
quick attempt to reproduce this failed).

I pushed a branch, https://github.com/ceph/ceph-ci/commit/wip-rmap-thing, 
with some debug output.  This will build at shaman.ceph.com shortly.  Do 
you mind running this on one of your mons with

 ceph-mon -i `hostname` -d 2>&1 | tee output

and capturing the output?  It looks like an out-of-bounds access on 
acting_rmap but it's not clear to me how that can happen.

Thanks!
sage

 > 
> 
> Thanks & Regards,
> Brayan
> 
> 
> 
> 
> On Mon, Aug 6, 2018 at 6:50 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
> 
> > Moving to ceph-devel.
> >
> > On Mon, 6 Aug 2018, Brayan Perera wrote:
> > > Dear All,
> > >
> > > We have encounter an issue on ceph mons, where all the ceph-mon instances
> > > went offline with following error.
> > >
> > >
> > > =======================
> > >      0> 2018-08-06 18:20:56.373266 7f9e4b98d700 -1 *** Caught signal
> > > (Segmentation fault) **
> > >  in thread 7f9e4b98d700 thread_name:cpu_tp
> > >
> > >  ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) luminous
> > > (stable)
> > >  1: (()+0x8f51b1) [0x560c971ab1b1]
> > >  2: (()+0xf6d0) [0x7f9e541106d0]
> > >  3: (void std::vector<pg_t,
> > > mempool::pool_allocator<(mempool::pool_index_t)16, pg_t>
> > > >::_M_emplace_back_aux<pg_t const&>(pg_t const&)+0x69) [0x560c96fb5fd9]
> > >  4: (OSDMapMapping::_build_rmap(OSDMap const&)+0x209) [0x560c96fb5389]
> > >  5: (OSDMapMapping::_finish(OSDMap const&)+0x11) [0x560c96fb53c1]
> > >  6: (ParallelPGMapper::Job::finish_one()+0x82) [0x560c96fb4332]
> > >  7: (ParallelPGMapper::WQ::_process(ParallelPGMapper::Item*,
> > > ThreadPool::TPHandle&)+0x7f) [0x560c96fb43ff]
> > >  8: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa8e) [0x560c96eaf59e]
> > >  9: (ThreadPool::WorkThread::entry()+0x10) [0x560c96eb0480]
> > >  10: (()+0x7e25) [0x7f9e54108e25]
> > >  11: (clone()+0x6d) [0x7f9e512febad]
> > >  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
> > > to interpret this.
> > > ============================
> >
> > I haven't seen this one before.  I take it it happens reliably when you
> > try to start the OSD?  Can you reproduce the bug and generate a mon log
> > file by setting 'debug mon = 20'?
> >
> > Thanks!
> > sage
> >
> > >
> > >
> > > Before this we have change one of the test pools size to 1 from 3 for
> > > testing purpose.
> > >
> > > Is there a way to recover this without loosing the data ?
> > >
> > >
> > > Thanks in advance.
> > >
> > > Regards,
> > > Brayan
> > >
> > >
> > >
> > >
> > >
> > >
> > > --
> > > Best Regards,
> > > Brayan Perera
> > >
> >
> 
> 
> 
> -- 
> Best Regards,
> Brayan Perera
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html