Re: Running master MON on FreeBSD 11.2 crashes all the time

kefu chai <tchaikov@xxxxxxxxx> · Tue, 4 Sep 2018 23:36:05 +0800



On Tue, Sep 4, 2018 at 11:12 PM kefu chai <tchaikov@xxxxxxxxx> wrote:
>
> On Tue, Sep 4, 2018 at 12:38 AM Willem Jan Withagen <wjw@xxxxxxxxxxx> wrote:
> >
> > On 03/09/2018 16:04, Willem Jan Withagen wrote:
> > > Hi,
> > >
> > > Since about 2 week master is crashing on one of my new test servers.
> > > And I can't seem to figure out why.
> > > It looks even like it crashes before ceph-mon starts to fork off
> > > threads, when I'm using vstart.sh to get a test cluster?
> > >
> > > Log dump is below, but the last message stems from:
> > >
> > > ./src/mon/OSDMonitor.cc:859
> > > creating_pgs_t
> > > OSDMonitor::update_pending_pgs() {
> > >    ........
> > >    ........
> > >    dout(10) << __func__ << " queue remaining: " <<
> > > pending_creatings.queue.size()
> > >             << " pools" << dendl;
> > >    dout(10) << __func__
> > >             << " " << (pending_creatings.pgs.size() - total)
> > >             << "/" << pending_creatings.pgs.size()
> > >             << " pgs added from queued pools" << dendl;
> > >    return pending_creatings;
> > > }
> > >
> > > Continue from there gives:
> > > Thread 1 received signal SIGSEGV, Segmentation fault.
> > > 0x0000000803a2587b in OSDMap::check_health (this=0x7fffffff3c68,
> > > checks=0x7fffffff3778)
> > >      at /home/wjw/wip/src/osd/OSDMap.cc:4999
> > > 4999        float fsr = g_conf()->osd_failsafe_full_ratio;
> > > Which is on the lines:
> > > 4995      // OSD_OUT_OF_ORDER_FULL
> > > 4996      {
> > > 4997        // An osd could configure failsafe ratio, to something
> > > different
> > > 4998        // but for now assume it is the same here.
> > > 4999        float fsr = g_conf()->osd_failsafe_full_ratio;
> > > 5000        if (fsr > 1.0) fsr /= 100;
> > > 5001        float fr = get_full_ratio();
> > > 5002        float br = get_backfillfull_ratio();
> > > 5003        float nr = get_nearfull_ratio();
> > >
> > > And in the end the debugger tells me it crashed like:
> > > #0  0x0000000803a2587b in OSDMap::check_health (this=0x7fffffff3c68,
> > > checks=0x7fffffff3778)
> > >      at /home/wjw/wip/src/osd/OSDMap.cc:4999
> > > #1  0x0000000000de2c71 in OSDMonitor::encode_pending (this=0x2e24000,
> > > t=...) at /home/wjw/wip/src/mon/OSDMonitor.cc:1459
> > > #2  0x00000000010399cf in PaxosService::propose_pending (this=0x2e24000)
> > > at /home/wjw/wip/src/mon/PaxosService.cc:213
> > > #3  0x000000000103b8b2 in PaxosService::_active (this=0x2e24000) at
> > > /home/wjw/wip/src/mon/PaxosService.cc:334
> > > #4  0x000000000103af96 in PaxosService::election_finished
> > > (this=0x2e24000) at /home/wjw/wip/src/mon/PaxosService.cc:290
> > > #5  0x0000000000bc5969 in Monitor::_finish_svc_election (this=0x2e11000)
> > > at /home/wjw/wip/src/mon/Monitor.cc:1960
> > > #6  0x0000000000bc4af7 in Monitor::win_election (this=0x2e11000,
> > > epoch=7, active=..., features=4611087854031142911,
> > >      mon_features=..., metadata=...) at
> > > /home/wjw/wip/src/mon/Monitor.cc:1994
> > > #7  0x0000000000bb5e28 in Monitor::win_standalone_election
> > > (this=0x2e11000) at /home/wjw/wip/src/mon/Monitor.cc:1935
> > > #8  0x0000000000bb3360 in Monitor::bootstrap (this=0x2e11000) at
> > > /home/wjw/wip/src/mon/Monitor.cc:1033
> > > #9  0x0000000000bb28ed in Monitor::init (this=0x2e11000) at
> > > /home/wjw/wip/src/mon/Monitor.cc:838
> > > #10 0x0000000000b29898 in main (argc=10, argv=0x7fffffffe818) at
> > > /home/wjw/wip/src/ceph_mon.cc:794
> > >
> > > Suggesting that g_conf() return a pointer that is is not valid...????
> > >
> > > And it is consistently at the location that it crashes...
> > > As usual, any suggestions are welcome.
> >
> > Did some more GDB tracing after a hunch, and I've now crudely augmented
> > src/common/config_proxy.h:
> > 41 ConfigValues* operator->() noexcept {
> > 42          assert((__uint64_t)(&values) != 0x0);
> > 43          assert((__uint64_t)(&values) > 0xfffff);
> > 44          return &values;
> > 45        }
> >
> > Turns out values = 0x8... Not much of an address.
> > Normally it is something like: 0x28f1008
> >
> > Not sure if the 8 is the significant denominator here
>
> i just setup a FreeBSD 11.2 vm. will try to reproduce this issue.


i have following warning messages:


/home/kefu/ceph/build/bin/ceph -c /home/kefu/ceph/build/ceph.conf -k
/home/kefu/ceph/build/keyring config set mgr mgr/restful/x/server_port
42658
__cxa_thread_call_dtors: dtr 0x8118a9a30 from unloaded dso, skipping
__cxa_thread_call_dtors: dtr 0x8118a9a30 from unloaded dso, skipping
__cxa_thread_call_dtors: dtr 0x8118a9a30 from unloaded dso, skipping


and i have

$ bin/init-ceph status mon 2>&1 | grep -v dangerous
=== mon.a ===
mon.a: running {"version":"Development","release":"nautilus","release_type":"dev"}

=== mon.b ===
mon.b: running {"version":"Development","release":"nautilus","release_type":"dev"}

=== mon.c ===
mon.c: running {"version":"Development","release":"nautilus","release_type":"dev"}


>
> >
> > --WjW
> >
>
>
> --
> Regards
> Kefu Chai


-- 
Regards
Kefu Chai