On Tue, Sep 4, 2018 at 12:38 AM Willem Jan Withagen <wjw@xxxxxxxxxxx> wrote: > > On 03/09/2018 16:04, Willem Jan Withagen wrote: > > Hi, > > > > Since about 2 week master is crashing on one of my new test servers. > > And I can't seem to figure out why. > > It looks even like it crashes before ceph-mon starts to fork off > > threads, when I'm using vstart.sh to get a test cluster? > > > > Log dump is below, but the last message stems from: > > > > ./src/mon/OSDMonitor.cc:859 > > creating_pgs_t > > OSDMonitor::update_pending_pgs() { > > ........ > > ........ > > dout(10) << __func__ << " queue remaining: " << > > pending_creatings.queue.size() > > << " pools" << dendl; > > dout(10) << __func__ > > << " " << (pending_creatings.pgs.size() - total) > > << "/" << pending_creatings.pgs.size() > > << " pgs added from queued pools" << dendl; > > return pending_creatings; > > } > > > > Continue from there gives: > > Thread 1 received signal SIGSEGV, Segmentation fault. > > 0x0000000803a2587b in OSDMap::check_health (this=0x7fffffff3c68, > > checks=0x7fffffff3778) > > at /home/wjw/wip/src/osd/OSDMap.cc:4999 > > 4999 float fsr = g_conf()->osd_failsafe_full_ratio; > > Which is on the lines: > > 4995 // OSD_OUT_OF_ORDER_FULL > > 4996 { > > 4997 // An osd could configure failsafe ratio, to something > > different > > 4998 // but for now assume it is the same here. > > 4999 float fsr = g_conf()->osd_failsafe_full_ratio; > > 5000 if (fsr > 1.0) fsr /= 100; > > 5001 float fr = get_full_ratio(); > > 5002 float br = get_backfillfull_ratio(); > > 5003 float nr = get_nearfull_ratio(); > > > > And in the end the debugger tells me it crashed like: > > #0 0x0000000803a2587b in OSDMap::check_health (this=0x7fffffff3c68, > > checks=0x7fffffff3778) > > at /home/wjw/wip/src/osd/OSDMap.cc:4999 > > #1 0x0000000000de2c71 in OSDMonitor::encode_pending (this=0x2e24000, > > t=...) at /home/wjw/wip/src/mon/OSDMonitor.cc:1459 > > #2 0x00000000010399cf in PaxosService::propose_pending (this=0x2e24000) > > at /home/wjw/wip/src/mon/PaxosService.cc:213 > > #3 0x000000000103b8b2 in PaxosService::_active (this=0x2e24000) at > > /home/wjw/wip/src/mon/PaxosService.cc:334 > > #4 0x000000000103af96 in PaxosService::election_finished > > (this=0x2e24000) at /home/wjw/wip/src/mon/PaxosService.cc:290 > > #5 0x0000000000bc5969 in Monitor::_finish_svc_election (this=0x2e11000) > > at /home/wjw/wip/src/mon/Monitor.cc:1960 > > #6 0x0000000000bc4af7 in Monitor::win_election (this=0x2e11000, > > epoch=7, active=..., features=4611087854031142911, > > mon_features=..., metadata=...) at > > /home/wjw/wip/src/mon/Monitor.cc:1994 > > #7 0x0000000000bb5e28 in Monitor::win_standalone_election > > (this=0x2e11000) at /home/wjw/wip/src/mon/Monitor.cc:1935 > > #8 0x0000000000bb3360 in Monitor::bootstrap (this=0x2e11000) at > > /home/wjw/wip/src/mon/Monitor.cc:1033 > > #9 0x0000000000bb28ed in Monitor::init (this=0x2e11000) at > > /home/wjw/wip/src/mon/Monitor.cc:838 > > #10 0x0000000000b29898 in main (argc=10, argv=0x7fffffffe818) at > > /home/wjw/wip/src/ceph_mon.cc:794 > > > > Suggesting that g_conf() return a pointer that is is not valid...???? > > > > And it is consistently at the location that it crashes... > > As usual, any suggestions are welcome. > > Did some more GDB tracing after a hunch, and I've now crudely augmented > src/common/config_proxy.h: > 41 ConfigValues* operator->() noexcept { > 42 assert((__uint64_t)(&values) != 0x0); > 43 assert((__uint64_t)(&values) > 0xfffff); > 44 return &values; > 45 } > > Turns out values = 0x8... Not much of an address. > Normally it is something like: 0x28f1008 > > Not sure if the 8 is the significant denominator here i just setup a FreeBSD 11.2 vm. will try to reproduce this issue. > > --WjW > -- Regards Kefu Chai