On Mon, 6 Aug 2018, Brayan Perera wrote: > Hi Sage, > > Please find the attachment. You need to wait until these builds complete, https://shaman.ceph.com/repos/ceph/wip-rmap-thing/520b28b2e461e2fdb317b1566f48ae844c08cb43/, then install on one of the mon hosts, and then run teh command in order to see the new debug output. (Or apply the patch locally.) Thanks! sage > > On Mon, Aug 6, 2018 at 7:54 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote: > > On Mon, 6 Aug 2018, Brayan Perera wrote: > >> Hi Sage, > >> > >> This is happened on production ceph cluster. Ceph mons were suddenly went > >> offline. > >> > >> When we check the logs while restarting the ceph-mon services we got above > >> error. Complete log entry is attached as 'ceph_mon_start_error.txt'. > >> > >> > >> Since issue started on one of the test pools. () > >> > >> > >> Pool is having size as 1. But the associated crush rule is having min size > >> set to 2. Can this cause the issue ? > >> > >> > >> As a workaround we we trying to change the crush rule and inject the crush > >> map. But still crush map is not getting changed. > >> > >> > >> /usr/bin/ceph-monstore-tool /var/lib/ceph/mon/ceph-cephmon rewrite-crush -- > >> --crush ./osdmap_crush.updated_compiled > >> > >> > >> Do we have an option to change pool size in the osdmap directly inside > >> ceph-mon ? > > > > I would not mess with the monitor's osdmap until we understand what is > > wrong. The sequence you describe shouldn't have caused a crash (and a > > quick attempt to reproduce this failed). > > > > I pushed a branch, https://github.com/ceph/ceph-ci/commit/wip-rmap-thing, > > with some debug output. This will build at shaman.ceph.com shortly. Do > > you mind running this on one of your mons with > > > > ceph-mon -i `hostname` -d 2>&1 | tee output > > > > and capturing the output? It looks like an out-of-bounds access on > > acting_rmap but it's not clear to me how that can happen. > > > > Thanks! > > sage > > > > > > > > >> > >> Thanks & Regards, > >> Brayan > >> > >> > >> > >> > >> On Mon, Aug 6, 2018 at 6:50 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote: > >> > >> > Moving to ceph-devel. > >> > > >> > On Mon, 6 Aug 2018, Brayan Perera wrote: > >> > > Dear All, > >> > > > >> > > We have encounter an issue on ceph mons, where all the ceph-mon instances > >> > > went offline with following error. > >> > > > >> > > > >> > > ======================= > >> > > 0> 2018-08-06 18:20:56.373266 7f9e4b98d700 -1 *** Caught signal > >> > > (Segmentation fault) ** > >> > > in thread 7f9e4b98d700 thread_name:cpu_tp > >> > > > >> > > ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) luminous > >> > > (stable) > >> > > 1: (()+0x8f51b1) [0x560c971ab1b1] > >> > > 2: (()+0xf6d0) [0x7f9e541106d0] > >> > > 3: (void std::vector<pg_t, > >> > > mempool::pool_allocator<(mempool::pool_index_t)16, pg_t> > >> > > >::_M_emplace_back_aux<pg_t const&>(pg_t const&)+0x69) [0x560c96fb5fd9] > >> > > 4: (OSDMapMapping::_build_rmap(OSDMap const&)+0x209) [0x560c96fb5389] > >> > > 5: (OSDMapMapping::_finish(OSDMap const&)+0x11) [0x560c96fb53c1] > >> > > 6: (ParallelPGMapper::Job::finish_one()+0x82) [0x560c96fb4332] > >> > > 7: (ParallelPGMapper::WQ::_process(ParallelPGMapper::Item*, > >> > > ThreadPool::TPHandle&)+0x7f) [0x560c96fb43ff] > >> > > 8: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa8e) [0x560c96eaf59e] > >> > > 9: (ThreadPool::WorkThread::entry()+0x10) [0x560c96eb0480] > >> > > 10: (()+0x7e25) [0x7f9e54108e25] > >> > > 11: (clone()+0x6d) [0x7f9e512febad] > >> > > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed > >> > > to interpret this. > >> > > ============================ > >> > > >> > I haven't seen this one before. I take it it happens reliably when you > >> > try to start the OSD? Can you reproduce the bug and generate a mon log > >> > file by setting 'debug mon = 20'? > >> > > >> > Thanks! > >> > sage > >> > > >> > > > >> > > > >> > > Before this we have change one of the test pools size to 1 from 3 for > >> > > testing purpose. > >> > > > >> > > Is there a way to recover this without loosing the data ? > >> > > > >> > > > >> > > Thanks in advance. > >> > > > >> > > Regards, > >> > > Brayan > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > -- > >> > > Best Regards, > >> > > Brayan Perera > >> > > > >> > > >> > >> > >> > >> -- > >> Best Regards, > >> Brayan Perera > >> > > > > -- > Best Regards, > Brayan Perera > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html