On Tue, Sep 4, 2018 at 11:12 PM kefu chai <tchaikov@xxxxxxxxx> wrote: > > On Tue, Sep 4, 2018 at 12:38 AM Willem Jan Withagen <wjw@xxxxxxxxxxx> wrote: > > > > On 03/09/2018 16:04, Willem Jan Withagen wrote: > > > Hi, > > > > > > Since about 2 week master is crashing on one of my new test servers. > > > And I can't seem to figure out why. > > > It looks even like it crashes before ceph-mon starts to fork off > > > threads, when I'm using vstart.sh to get a test cluster? > > > > > > Log dump is below, but the last message stems from: > > > > > > ./src/mon/OSDMonitor.cc:859 > > > creating_pgs_t > > > OSDMonitor::update_pending_pgs() { > > > ........ > > > ........ > > > dout(10) << __func__ << " queue remaining: " << > > > pending_creatings.queue.size() > > > << " pools" << dendl; > > > dout(10) << __func__ > > > << " " << (pending_creatings.pgs.size() - total) > > > << "/" << pending_creatings.pgs.size() > > > << " pgs added from queued pools" << dendl; > > > return pending_creatings; > > > } > > > > > > Continue from there gives: > > > Thread 1 received signal SIGSEGV, Segmentation fault. > > > 0x0000000803a2587b in OSDMap::check_health (this=0x7fffffff3c68, > > > checks=0x7fffffff3778) > > > at /home/wjw/wip/src/osd/OSDMap.cc:4999 > > > 4999 float fsr = g_conf()->osd_failsafe_full_ratio; > > > Which is on the lines: > > > 4995 // OSD_OUT_OF_ORDER_FULL > > > 4996 { > > > 4997 // An osd could configure failsafe ratio, to something > > > different > > > 4998 // but for now assume it is the same here. > > > 4999 float fsr = g_conf()->osd_failsafe_full_ratio; > > > 5000 if (fsr > 1.0) fsr /= 100; > > > 5001 float fr = get_full_ratio(); > > > 5002 float br = get_backfillfull_ratio(); > > > 5003 float nr = get_nearfull_ratio(); > > > > > > And in the end the debugger tells me it crashed like: > > > #0 0x0000000803a2587b in OSDMap::check_health (this=0x7fffffff3c68, > > > checks=0x7fffffff3778) > > > at /home/wjw/wip/src/osd/OSDMap.cc:4999 > > > #1 0x0000000000de2c71 in OSDMonitor::encode_pending (this=0x2e24000, > > > t=...) at /home/wjw/wip/src/mon/OSDMonitor.cc:1459 > > > #2 0x00000000010399cf in PaxosService::propose_pending (this=0x2e24000) > > > at /home/wjw/wip/src/mon/PaxosService.cc:213 > > > #3 0x000000000103b8b2 in PaxosService::_active (this=0x2e24000) at > > > /home/wjw/wip/src/mon/PaxosService.cc:334 > > > #4 0x000000000103af96 in PaxosService::election_finished > > > (this=0x2e24000) at /home/wjw/wip/src/mon/PaxosService.cc:290 > > > #5 0x0000000000bc5969 in Monitor::_finish_svc_election (this=0x2e11000) > > > at /home/wjw/wip/src/mon/Monitor.cc:1960 > > > #6 0x0000000000bc4af7 in Monitor::win_election (this=0x2e11000, > > > epoch=7, active=..., features=4611087854031142911, > > > mon_features=..., metadata=...) at > > > /home/wjw/wip/src/mon/Monitor.cc:1994 > > > #7 0x0000000000bb5e28 in Monitor::win_standalone_election > > > (this=0x2e11000) at /home/wjw/wip/src/mon/Monitor.cc:1935 > > > #8 0x0000000000bb3360 in Monitor::bootstrap (this=0x2e11000) at > > > /home/wjw/wip/src/mon/Monitor.cc:1033 > > > #9 0x0000000000bb28ed in Monitor::init (this=0x2e11000) at > > > /home/wjw/wip/src/mon/Monitor.cc:838 > > > #10 0x0000000000b29898 in main (argc=10, argv=0x7fffffffe818) at > > > /home/wjw/wip/src/ceph_mon.cc:794 > > > > > > Suggesting that g_conf() return a pointer that is is not valid...???? > > > > > > And it is consistently at the location that it crashes... > > > As usual, any suggestions are welcome. > > > > Did some more GDB tracing after a hunch, and I've now crudely augmented > > src/common/config_proxy.h: > > 41 ConfigValues* operator->() noexcept { > > 42 assert((__uint64_t)(&values) != 0x0); > > 43 assert((__uint64_t)(&values) > 0xfffff); > > 44 return &values; > > 45 } > > > > Turns out values = 0x8... Not much of an address. > > Normally it is something like: 0x28f1008 > > > > Not sure if the 8 is the significant denominator here > > i just setup a FreeBSD 11.2 vm. will try to reproduce this issue. i have following warning messages: /home/kefu/ceph/build/bin/ceph -c /home/kefu/ceph/build/ceph.conf -k /home/kefu/ceph/build/keyring config set mgr mgr/restful/x/server_port 42658 __cxa_thread_call_dtors: dtr 0x8118a9a30 from unloaded dso, skipping __cxa_thread_call_dtors: dtr 0x8118a9a30 from unloaded dso, skipping __cxa_thread_call_dtors: dtr 0x8118a9a30 from unloaded dso, skipping and i have $ bin/init-ceph status mon 2>&1 | grep -v dangerous === mon.a === mon.a: running {"version":"Development","release":"nautilus","release_type":"dev"} === mon.b === mon.b: running {"version":"Development","release":"nautilus","release_type":"dev"} === mon.c === mon.c: running {"version":"Development","release":"nautilus","release_type":"dev"} > > > > > --WjW > > > > > -- > Regards > Kefu Chai -- Regards Kefu Chai