Great to hear. Thanks for help diagnosing the issue. The full fix is here: https://github.com/ceph/ceph/pull/23449 and will hopefully make it into the next round of mimic and luminous releases. Thanks! sage On Tue, 7 Aug 2018, Brayan Perera wrote: > Hi Sage, > > Fix is working and we were able to start the ceph-mon services. Now > ceph cluster is up and running. > > Thanks a lot for the help. > > Regards, > Brayan > > On Mon, Aug 6, 2018 at 10:44 PM, Brayan Perera <brayan.perera@xxxxxxxxx> wrote: > > Hi Sage, > > > > Thanks a lot for your help. Will update you once we apply the fix. > > > > Thanks & Regards, > > Brayan > > > > On Mon, Aug 6, 2018 at 10:39 PM, Sage Weil <sweil@xxxxxxxxxx> wrote: > >> On Mon, 6 Aug 2018, Brayan Perera wrote: > >>> Hi Sage, > >>> > >>> > >>> Above is from osdmap which I have extracted from one of the ceph-mons > >>> > >>> =================== > >>> pool 111 'moc_test' replicated size 1 min_size 1 crush_rule 2 > >>> object_hash rjenkins pg_num 16 pgp_num 16 last_change 6337 flags > >>> hashpspool stripe_width 0 > >>> > >>> pg_temp 111.1 [50,33,25] > >>> pg_temp 111.2 [50,33,40] > >>> pg_temp 111.3 [45,40,6] > >>> pg_temp 111.4 [45,0,33] > >>> pg_temp 111.7 [6,32,19] > >>> pg_temp 111.8 [22,14,58] > >>> pg_temp 111.9 [40,9,50] > >>> pg_temp 111.a [15,14,22] > >>> pg_temp 111.b [50,6,22] > >>> pg_temp 111.c [59,37,19] > >>> pg_temp 111.d [29,0,52] > >>> pg_temp 111.f [33,32,6] > >>> =================== > >> > >> Yep, that's the bug. I've opened http://tracker.ceph.com/issues/26866 and > >> I've pushed a temporary workaround to that same branch that should get > >> your mons started. Builds will take an hour-ish again! > >> > >> Thanks- > >> sage > >> > >> > >>> > >>> > >>> Thanks & Regards, > >>> Brayan > >>> > >>> On Mon, Aug 6, 2018 at 10:21 PM, Brayan Perera <brayan.perera@xxxxxxxxx> wrote: > >>> > Hi Sage, > >>> > > >>> > > >>> > No, this is a regular pool. > >>> > > >>> > This pool is a test pool. Other pools are having production data. > >>> > > >>> > Is there a way to get rid of this pool as a workaround ? > >>> > > >>> > Thanks & Regards, > >>> > Brayan > >>> > > >>> > On Mon, Aug 6, 2018 at 10:20 PM, Sage Weil <sweil@xxxxxxxxxx> wrote: > >>> >> Is that an EC pool? > >>> >> > >>> >> On Mon, 6 Aug 2018, Brayan Perera wrote: > >>> >> > >>> >>> Hi Sage, > >>> >>> > >>> >>> Please find the attachment. Sorry I could not use ceph-post-file option. > >>> >>> > >>> >>> Looks like one of the pgs mapping is invalid. 111.4 osd.-1 > >>> >>> > >>> >>> Is there fix for this issue ? > >>> >>> > >>> >>> Really appreciate your help on this. > >>> >>> > >>> >>> > >>> >>> Thanks & Regards, > >>> >>> Brayan > >>> >>> > >>> >>> > >>> >>> On Mon, Aug 6, 2018 at 8:46 PM, Sage Weil <sweil@xxxxxxxxxx> wrote: > >>> >>> > no problem, just wanted to make sure we weren't waiting for a build that > >>> >>> > did exist :) > >>> >>> > > >>> >>> > On Mon, 6 Aug 2018, Brayan Perera wrote: > >>> >>> > > >>> >>> >> We are on Centos 7. Sorry I forgot to inform that. > >>> >>> >> > >>> >>> >> Regards, > >>> >>> >> Brayan > >>> >>> >> > >>> >>> >> On Mon, Aug 6, 2018 at 8:43 PM, Sage Weil <sweil@xxxxxxxxxx> wrote: > >>> >>> >> > are you on centos7 or ubuntu xenial? those are the two distros being > >>> >>> >> > built... > >>> >>> >> > > >>> >>> >> > On Mon, 6 Aug 2018, Brayan Perera wrote: > >>> >>> >> > > >>> >>> >> >> Hi Sage, > >>> >>> >> >> > >>> >>> >> >> Thanks for the change. Will test once build is available. Only > >>> >>> >> >> ceph-mon rpm is enough or we have to use whole repository ? > >>> >>> >> >> > >>> >>> >> >> Thanks & Regards, > >>> >>> >> >> Brayan > >>> >>> >> >> > >>> >>> >> >> > >>> >>> >> >> On Mon, Aug 6, 2018 at 8:36 PM, Sage Weil <sage@xxxxxxxxxx> wrote: > >>> >>> >> >> > Also, feel free to you 'ceph-post-file <file>' instead of an attachment > >>> >>> >> >> > (it will restrict access to the file to ceph developers w/ access to the > >>> >>> >> >> > shared test lab). > >>> >>> >> >> > > >>> >>> >> >> > sage > >>> >>> >> >> > > >>> >>> >> >> > > >>> >>> >> >> > On Mon, 6 Aug 2018, Sage Weil wrote: > >>> >>> >> >> > > >>> >>> >> >> >> On Mon, 6 Aug 2018, Brayan Perera wrote: > >>> >>> >> >> >> > Hi Sage, > >>> >>> >> >> >> > > >>> >>> >> >> >> > Please find the attachment. > >>> >>> >> >> >> > >>> >>> >> >> >> You need to wait until these builds complete, > >>> >>> >> >> >> https://shaman.ceph.com/repos/ceph/wip-rmap-thing/520b28b2e461e2fdb317b1566f48ae844c08cb43/, > >>> >>> >> >> >> then install on one of the mon hosts, and then run teh command in order to > >>> >>> >> >> >> see the new debug output. (Or apply the patch locally.) > >>> >>> >> >> >> > >>> >>> >> >> >> Thanks! > >>> >>> >> >> >> sage > >>> >>> >> >> >> > >>> >>> >> >> >> > > >>> >>> >> >> >> > On Mon, Aug 6, 2018 at 7:54 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote: > >>> >>> >> >> >> > > On Mon, 6 Aug 2018, Brayan Perera wrote: > >>> >>> >> >> >> > >> Hi Sage, > >>> >>> >> >> >> > >> > >>> >>> >> >> >> > >> This is happened on production ceph cluster. Ceph mons were suddenly went > >>> >>> >> >> >> > >> offline. > >>> >>> >> >> >> > >> > >>> >>> >> >> >> > >> When we check the logs while restarting the ceph-mon services we got above > >>> >>> >> >> >> > >> error. Complete log entry is attached as 'ceph_mon_start_error.txt'. > >>> >>> >> >> >> > >> > >>> >>> >> >> >> > >> > >>> >>> >> >> >> > >> Since issue started on one of the test pools. () > >>> >>> >> >> >> > >> > >>> >>> >> >> >> > >> > >>> >>> >> >> >> > >> Pool is having size as 1. But the associated crush rule is having min size > >>> >>> >> >> >> > >> set to 2. Can this cause the issue ? > >>> >>> >> >> >> > >> > >>> >>> >> >> >> > >> > >>> >>> >> >> >> > >> As a workaround we we trying to change the crush rule and inject the crush > >>> >>> >> >> >> > >> map. But still crush map is not getting changed. > >>> >>> >> >> >> > >> > >>> >>> >> >> >> > >> > >>> >>> >> >> >> > >> /usr/bin/ceph-monstore-tool /var/lib/ceph/mon/ceph-cephmon rewrite-crush -- > >>> >>> >> >> >> > >> --crush ./osdmap_crush.updated_compiled > >>> >>> >> >> >> > >> > >>> >>> >> >> >> > >> > >>> >>> >> >> >> > >> Do we have an option to change pool size in the osdmap directly inside > >>> >>> >> >> >> > >> ceph-mon ? > >>> >>> >> >> >> > > > >>> >>> >> >> >> > > I would not mess with the monitor's osdmap until we understand what is > >>> >>> >> >> >> > > wrong. The sequence you describe shouldn't have caused a crash (and a > >>> >>> >> >> >> > > quick attempt to reproduce this failed). > >>> >>> >> >> >> > > > >>> >>> >> >> >> > > I pushed a branch, https://github.com/ceph/ceph-ci/commit/wip-rmap-thing, > >>> >>> >> >> >> > > with some debug output. This will build at shaman.ceph.com shortly. Do > >>> >>> >> >> >> > > you mind running this on one of your mons with > >>> >>> >> >> >> > > > >>> >>> >> >> >> > > ceph-mon -i `hostname` -d 2>&1 | tee output > >>> >>> >> >> >> > > > >>> >>> >> >> >> > > and capturing the output? It looks like an out-of-bounds access on > >>> >>> >> >> >> > > acting_rmap but it's not clear to me how that can happen. > >>> >>> >> >> >> > > > >>> >>> >> >> >> > > Thanks! > >>> >>> >> >> >> > > sage > >>> >>> >> >> >> > > > >>> >>> >> >> >> > > > >>> >>> >> >> >> > > > > >>> >>> >> >> >> > >> > >>> >>> >> >> >> > >> Thanks & Regards, > >>> >>> >> >> >> > >> Brayan > >>> >>> >> >> >> > >> > >>> >>> >> >> >> > >> > >>> >>> >> >> >> > >> > >>> >>> >> >> >> > >> > >>> >>> >> >> >> > >> On Mon, Aug 6, 2018 at 6:50 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote: > >>> >>> >> >> >> > >> > >>> >>> >> >> >> > >> > Moving to ceph-devel. > >>> >>> >> >> >> > >> > > >>> >>> >> >> >> > >> > On Mon, 6 Aug 2018, Brayan Perera wrote: > >>> >>> >> >> >> > >> > > Dear All, > >>> >>> >> >> >> > >> > > > >>> >>> >> >> >> > >> > > We have encounter an issue on ceph mons, where all the ceph-mon instances > >>> >>> >> >> >> > >> > > went offline with following error. > >>> >>> >> >> >> > >> > > > >>> >>> >> >> >> > >> > > > >>> >>> >> >> >> > >> > > ======================= > >>> >>> >> >> >> > >> > > 0> 2018-08-06 18:20:56.373266 7f9e4b98d700 -1 *** Caught signal > >>> >>> >> >> >> > >> > > (Segmentation fault) ** > >>> >>> >> >> >> > >> > > in thread 7f9e4b98d700 thread_name:cpu_tp > >>> >>> >> >> >> > >> > > > >>> >>> >> >> >> > >> > > ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) luminous > >>> >>> >> >> >> > >> > > (stable) > >>> >>> >> >> >> > >> > > 1: (()+0x8f51b1) [0x560c971ab1b1] > >>> >>> >> >> >> > >> > > 2: (()+0xf6d0) [0x7f9e541106d0] > >>> >>> >> >> >> > >> > > 3: (void std::vector<pg_t, > >>> >>> >> >> >> > >> > > mempool::pool_allocator<(mempool::pool_index_t)16, pg_t> > >>> >>> >> >> >> > >> > > >::_M_emplace_back_aux<pg_t const&>(pg_t const&)+0x69) [0x560c96fb5fd9] > >>> >>> >> >> >> > >> > > 4: (OSDMapMapping::_build_rmap(OSDMap const&)+0x209) [0x560c96fb5389] > >>> >>> >> >> >> > >> > > 5: (OSDMapMapping::_finish(OSDMap const&)+0x11) [0x560c96fb53c1] > >>> >>> >> >> >> > >> > > 6: (ParallelPGMapper::Job::finish_one()+0x82) [0x560c96fb4332] > >>> >>> >> >> >> > >> > > 7: (ParallelPGMapper::WQ::_process(ParallelPGMapper::Item*, > >>> >>> >> >> >> > >> > > ThreadPool::TPHandle&)+0x7f) [0x560c96fb43ff] > >>> >>> >> >> >> > >> > > 8: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa8e) [0x560c96eaf59e] > >>> >>> >> >> >> > >> > > 9: (ThreadPool::WorkThread::entry()+0x10) [0x560c96eb0480] > >>> >>> >> >> >> > >> > > 10: (()+0x7e25) [0x7f9e54108e25] > >>> >>> >> >> >> > >> > > 11: (clone()+0x6d) [0x7f9e512febad] > >>> >>> >> >> >> > >> > > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed > >>> >>> >> >> >> > >> > > to interpret this. > >>> >>> >> >> >> > >> > > ============================ > >>> >>> >> >> >> > >> > > >>> >>> >> >> >> > >> > I haven't seen this one before. I take it it happens reliably when you > >>> >>> >> >> >> > >> > try to start the OSD? Can you reproduce the bug and generate a mon log > >>> >>> >> >> >> > >> > file by setting 'debug mon = 20'? > >>> >>> >> >> >> > >> > > >>> >>> >> >> >> > >> > Thanks! > >>> >>> >> >> >> > >> > sage > >>> >>> >> >> >> > >> > > >>> >>> >> >> >> > >> > > > >>> >>> >> >> >> > >> > > > >>> >>> >> >> >> > >> > > Before this we have change one of the test pools size to 1 from 3 for > >>> >>> >> >> >> > >> > > testing purpose. > >>> >>> >> >> >> > >> > > > >>> >>> >> >> >> > >> > > Is there a way to recover this without loosing the data ? > >>> >>> >> >> >> > >> > > > >>> >>> >> >> >> > >> > > > >>> >>> >> >> >> > >> > > Thanks in advance. > >>> >>> >> >> >> > >> > > > >>> >>> >> >> >> > >> > > Regards, > >>> >>> >> >> >> > >> > > Brayan > >>> >>> >> >> >> > >> > > > >>> >>> >> >> >> > >> > > > >>> >>> >> >> >> > >> > > > >>> >>> >> >> >> > >> > > > >>> >>> >> >> >> > >> > > > >>> >>> >> >> >> > >> > > > >>> >>> >> >> >> > >> > > -- > >>> >>> >> >> >> > >> > > Best Regards, > >>> >>> >> >> >> > >> > > Brayan Perera > >>> >>> >> >> >> > >> > > > >>> >>> >> >> >> > >> > > >>> >>> >> >> >> > >> > >>> >>> >> >> >> > >> > >>> >>> >> >> >> > >> > >>> >>> >> >> >> > >> -- > >>> >>> >> >> >> > >> Best Regards, > >>> >>> >> >> >> > >> Brayan Perera > >>> >>> >> >> >> > >> > >>> >>> >> >> >> > > >>> >>> >> >> >> > > >>> >>> >> >> >> > > >>> >>> >> >> >> > -- > >>> >>> >> >> >> > Best Regards, > >>> >>> >> >> >> > Brayan Perera > >>> >>> >> >> >> > > >>> >>> >> >> >> -- > >>> >>> >> >> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > >>> >>> >> >> >> the body of a message to majordomo@xxxxxxxxxxxxxxx > >>> >>> >> >> >> More majordomo info at http://vger.kernel.org/majordomo-info.html > >>> >>> >> >> >> > >>> >>> >> >> >> > >>> >>> >> >> > >>> >>> >> >> > >>> >>> >> >> > >>> >>> >> >> -- > >>> >>> >> >> Best Regards, > >>> >>> >> >> Brayan Perera > >>> >>> >> >> > >>> >>> >> >> > >>> >>> >> > >>> >>> >> > >>> >>> >> > >>> >>> >> -- > >>> >>> >> Best Regards, > >>> >>> >> Brayan Perera > >>> >>> >> > >>> >>> >> > >>> >>> > >>> >>> > >>> >>> > >>> >>> -- > >>> >>> Best Regards, > >>> >>> Brayan Perera > >>> >>> > >>> > > >>> > > >>> > > >>> > -- > >>> > Best Regards, > >>> > Brayan Perera > >>> > >>> > >>> > >>> -- > >>> Best Regards, > >>> Brayan Perera > >>> > >>> > > > > > > > > -- > > Best Regards, > > Brayan Perera > > > > -- > Best Regards, > Brayan Perera > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html