Re: Running master MON on FreeBSD 11.2 crashes all the time

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 03/09/2018 16:04, Willem Jan Withagen wrote:
Hi,

Since about 2 week master is crashing on one of my new test servers.
And I can't seem to figure out why.
It looks even like it crashes before ceph-mon starts to fork off threads, when I'm using vstart.sh to get a test cluster?

Log dump is below, but the last message stems from:

./src/mon/OSDMonitor.cc:859
creating_pgs_t
OSDMonitor::update_pending_pgs() {
   ........
   ........
  dout(10) << __func__ << " queue remaining: " << pending_creatings.queue.size()
            << " pools" << dendl;
   dout(10) << __func__
            << " " << (pending_creatings.pgs.size() - total)
            << "/" << pending_creatings.pgs.size()
            << " pgs added from queued pools" << dendl;
   return pending_creatings;
}

Continue from there gives:
Thread 1 received signal SIGSEGV, Segmentation fault.
0x0000000803a2587b in OSDMap::check_health (this=0x7fffffff3c68, checks=0x7fffffff3778)
     at /home/wjw/wip/src/osd/OSDMap.cc:4999
4999        float fsr = g_conf()->osd_failsafe_full_ratio;
Which is on the lines:
4995      // OSD_OUT_OF_ORDER_FULL
4996      {
4997        // An osd could configure failsafe ratio, to something different
4998        // but for now assume it is the same here.
4999        float fsr = g_conf()->osd_failsafe_full_ratio;
5000        if (fsr > 1.0) fsr /= 100;
5001        float fr = get_full_ratio();
5002        float br = get_backfillfull_ratio();
5003        float nr = get_nearfull_ratio();

And in the end the debugger tells me it crashed like:
#0  0x0000000803a2587b in OSDMap::check_health (this=0x7fffffff3c68, checks=0x7fffffff3778)
     at /home/wjw/wip/src/osd/OSDMap.cc:4999
#1  0x0000000000de2c71 in OSDMonitor::encode_pending (this=0x2e24000, t=...) at /home/wjw/wip/src/mon/OSDMonitor.cc:1459 #2  0x00000000010399cf in PaxosService::propose_pending (this=0x2e24000) at /home/wjw/wip/src/mon/PaxosService.cc:213 #3  0x000000000103b8b2 in PaxosService::_active (this=0x2e24000) at /home/wjw/wip/src/mon/PaxosService.cc:334 #4  0x000000000103af96 in PaxosService::election_finished (this=0x2e24000) at /home/wjw/wip/src/mon/PaxosService.cc:290 #5  0x0000000000bc5969 in Monitor::_finish_svc_election (this=0x2e11000) at /home/wjw/wip/src/mon/Monitor.cc:1960 #6  0x0000000000bc4af7 in Monitor::win_election (this=0x2e11000, epoch=7, active=..., features=4611087854031142911,     mon_features=..., metadata=...) at /home/wjw/wip/src/mon/Monitor.cc:1994 #7  0x0000000000bb5e28 in Monitor::win_standalone_election (this=0x2e11000) at /home/wjw/wip/src/mon/Monitor.cc:1935 #8  0x0000000000bb3360 in Monitor::bootstrap (this=0x2e11000) at /home/wjw/wip/src/mon/Monitor.cc:1033 #9  0x0000000000bb28ed in Monitor::init (this=0x2e11000) at /home/wjw/wip/src/mon/Monitor.cc:838 #10 0x0000000000b29898 in main (argc=10, argv=0x7fffffffe818) at /home/wjw/wip/src/ceph_mon.cc:794

Suggesting that g_conf() return a pointer that is is not valid...????

And it is consistently at the location that it crashes...
As usual, any suggestions are welcome.

Did some more GDB tracing after a hunch, and I've now crudely augmented
src/common/config_proxy.h:
41 ConfigValues* operator->() noexcept {
42          assert((__uint64_t)(&values) != 0x0);
43          assert((__uint64_t)(&values) > 0xfffff);
44          return &values;
45        }

Turns out values = 0x8... Not much of an address.
Normally it is something like: 0x28f1008

Not sure if the 8 is the significant denominator here

--WjW




[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux