On Mon, 6 Aug 2018, Brayan Perera wrote: > Hi Sage, > > This is happened on production ceph cluster. Ceph mons were suddenly went > offline. > > When we check the logs while restarting the ceph-mon services we got above > error. Complete log entry is attached as 'ceph_mon_start_error.txt'. > > > Since issue started on one of the test pools. () > > > Pool is having size as 1. But the associated crush rule is having min size > set to 2. Can this cause the issue ? > > > As a workaround we we trying to change the crush rule and inject the crush > map. But still crush map is not getting changed. > > > /usr/bin/ceph-monstore-tool /var/lib/ceph/mon/ceph-cephmon rewrite-crush -- > --crush ./osdmap_crush.updated_compiled > > > Do we have an option to change pool size in the osdmap directly inside > ceph-mon ? I would not mess with the monitor's osdmap until we understand what is wrong. The sequence you describe shouldn't have caused a crash (and a quick attempt to reproduce this failed). I pushed a branch, https://github.com/ceph/ceph-ci/commit/wip-rmap-thing, with some debug output. This will build at shaman.ceph.com shortly. Do you mind running this on one of your mons with ceph-mon -i `hostname` -d 2>&1 | tee output and capturing the output? It looks like an out-of-bounds access on acting_rmap but it's not clear to me how that can happen. Thanks! sage > > > Thanks & Regards, > Brayan > > > > > On Mon, Aug 6, 2018 at 6:50 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote: > > > Moving to ceph-devel. > > > > On Mon, 6 Aug 2018, Brayan Perera wrote: > > > Dear All, > > > > > > We have encounter an issue on ceph mons, where all the ceph-mon instances > > > went offline with following error. > > > > > > > > > ======================= > > > 0> 2018-08-06 18:20:56.373266 7f9e4b98d700 -1 *** Caught signal > > > (Segmentation fault) ** > > > in thread 7f9e4b98d700 thread_name:cpu_tp > > > > > > ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) luminous > > > (stable) > > > 1: (()+0x8f51b1) [0x560c971ab1b1] > > > 2: (()+0xf6d0) [0x7f9e541106d0] > > > 3: (void std::vector<pg_t, > > > mempool::pool_allocator<(mempool::pool_index_t)16, pg_t> > > > >::_M_emplace_back_aux<pg_t const&>(pg_t const&)+0x69) [0x560c96fb5fd9] > > > 4: (OSDMapMapping::_build_rmap(OSDMap const&)+0x209) [0x560c96fb5389] > > > 5: (OSDMapMapping::_finish(OSDMap const&)+0x11) [0x560c96fb53c1] > > > 6: (ParallelPGMapper::Job::finish_one()+0x82) [0x560c96fb4332] > > > 7: (ParallelPGMapper::WQ::_process(ParallelPGMapper::Item*, > > > ThreadPool::TPHandle&)+0x7f) [0x560c96fb43ff] > > > 8: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa8e) [0x560c96eaf59e] > > > 9: (ThreadPool::WorkThread::entry()+0x10) [0x560c96eb0480] > > > 10: (()+0x7e25) [0x7f9e54108e25] > > > 11: (clone()+0x6d) [0x7f9e512febad] > > > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed > > > to interpret this. > > > ============================ > > > > I haven't seen this one before. I take it it happens reliably when you > > try to start the OSD? Can you reproduce the bug and generate a mon log > > file by setting 'debug mon = 20'? > > > > Thanks! > > sage > > > > > > > > > > > Before this we have change one of the test pools size to 1 from 3 for > > > testing purpose. > > > > > > Is there a way to recover this without loosing the data ? > > > > > > > > > Thanks in advance. > > > > > > Regards, > > > Brayan > > > > > > > > > > > > > > > > > > > > > -- > > > Best Regards, > > > Brayan Perera > > > > > > > > > -- > Best Regards, > Brayan Perera > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html