Re: all three mons segfault at same time

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I am on trusty also but my /var/lib/ceph/mon lives on an xfs filesystem.

My mons seem to have stabilized now after upgrading the last of the
OSDs to 0.94.5. No crashes in the last 20 minutes whereas they were
crashing every 1-2 minutes in a rolling fashion the entire time I was
upgrading OSDs.

On Tue, Nov 10, 2015 at 12:03 PM, Arnulf Heimsbakk <aheimsbakk@xxxxxx> wrote:
> Hi Logan!
>
> It seems that I've solved the segfaults on my monitors. Maybe not in the
> best way, but they seem to be gone. Original my monitor servers ran
> Ubuntu Trusty on ext4, but they've now been converted to CentOS 7 with
> XFS as root file system. They've run stable for 24H now.
>
> I'm still running Ubuntu on my OSDs and no issues so far running mixed
> OS. Everything is running 0.94.5.
>
> Not a ideal solution, but I'm preparing to convert OSDs to CentOS too if
> things stay stable over time.
>
> -Arnulf
>
> On 11/10/2015 05:13 PM, Logan V. wrote:
>> I am in the process of upgrading a cluster with mixed 0.94.2/0.94.3 to
>> 0.94.5 this morning and am seeing identical crashes. In the process of
>> doing a rolling upgrade across the mons this morning, after the 3rd of
>> 3 mons was restarted to 0.94.5, all 3 crashed simultaneously identical
>> to what you are describing above. Now I am seeing rolling crashes
>> across the 3 mons continually. I am still in the process of upgrading
>> about 200 OSDs to 0.94.5 so most of them are still running 0.94.2 and
>> 0.94.3. There are 3 mds's running 0.94.5 during these crashes.
>>
>> ==> /var/log/clusterboot/lsn-mc1008/syslog <==
>> Nov 10 10:07:30 lsn-mc1008 kernel: [6392349.844640] init: ceph-mon
>> (ceph/lsn-mc1008) main process (2254664) killed by SEGV signal
>> Nov 10 10:07:30 lsn-mc1008 kernel: [6392349.844648] init: ceph-mon
>> (ceph/lsn-mc1008) main process ended, respawning
>> ==> /var/log/clusterboot/lsn-mc1006/syslog <==
>> Nov 10 10:07:46 lsn-mc1006 kernel: [6392890.294124] init: ceph-mon
>> (ceph/lsn-mc1006) main process (2183307) killed by SEGV signal
>> Nov 10 10:07:46 lsn-mc1006 kernel: [6392890.294132] init: ceph-mon
>> (ceph/lsn-mc1006) main process ended, respawning
>> ==> /var/log/clusterboot/lsn-mc1007/syslog <==
>> Nov 10 10:07:46 lsn-mc1007 kernel: [6392599.894914] init: ceph-mon
>> (ceph/lsn-mc1007) main process (1998234) killed by SEGV signal
>> Nov 10 10:07:46 lsn-mc1007 kernel: [6392599.894923] init: ceph-mon
>> (ceph/lsn-mc1007) main process ended, respawning
>> ==> /var/log/clusterboot/lsn-mc1008/syslog <==
>> Nov 10 10:07:46 lsn-mc1008 kernel: [6392365.959984] init: ceph-mon
>> (ceph/lsn-mc1008) main process (2263082) killed by SEGV signal
>> Nov 10 10:07:46 lsn-mc1008 kernel: [6392365.959992] init: ceph-mon
>> (ceph/lsn-mc1008) main process ended, respawning
>> ==> /var/log/clusterboot/lsn-mc1006/syslog <==
>> Nov 10 10:07:52 lsn-mc1006 kernel: [6392896.674332] init: ceph-mon
>> (ceph/lsn-mc1006) main process (2191273) killed by SEGV signal
>> Nov 10 10:07:52 lsn-mc1006 kernel: [6392896.674340] init: ceph-mon
>> (ceph/lsn-mc1006) main process ended, respawning
>> ==> /var/log/clusterboot/lsn-mc1008/syslog <==
>> Nov 10 10:07:52 lsn-mc1008 kernel: [6392372.324282] init: ceph-mon
>> (ceph/lsn-mc1008) main process (2270979) killed by SEGV signal
>> Nov 10 10:07:52 lsn-mc1008 kernel: [6392372.324295] init: ceph-mon
>> (ceph/lsn-mc1008) main process ended, respawning
>> ==> /var/log/clusterboot/lsn-mc1007/syslog <==
>> Nov 10 10:07:52 lsn-mc1007 kernel: [6392606.272911] init: ceph-mon
>> (ceph/lsn-mc1007) main process (2006118) killed by SEGV signal
>> Nov 10 10:07:52 lsn-mc1007 kernel: [6392606.272995] init: ceph-mon
>> (ceph/lsn-mc1007) main process ended, respawning
>> ==> /var/log/clusterboot/lsn-mc1006/syslog <==
>> Nov 10 10:07:55 lsn-mc1006 kernel: [6392899.046307] init: ceph-mon
>> (ceph/lsn-mc1006) main process (2192187) killed by SEGV signal
>> Nov 10 10:07:55 lsn-mc1006 kernel: [6392899.046315] init: ceph-mon
>> (ceph/lsn-mc1006) main process ended, respawning
>> ==> /var/log/clusterboot/lsn-mc1007/syslog <==
>> Nov 10 10:08:17 lsn-mc1007 kernel: [6392631.192476] init: ceph-mon
>> (ceph/lsn-mc1007) main process (2006489) killed by SEGV signal
>> Nov 10 10:08:17 lsn-mc1007 kernel: [6392631.192484] init: ceph-mon
>> (ceph/lsn-mc1007) main process ended, respawning
>> ==> /var/log/clusterboot/lsn-mc1006/syslog <==
>> Nov 10 10:08:17 lsn-mc1006 kernel: [6392921.600089] init: ceph-mon
>> (ceph/lsn-mc1006) main process (2192298) killed by SEGV signal
>> Nov 10 10:08:17 lsn-mc1006 kernel: [6392921.600108] init: ceph-mon
>> (ceph/lsn-mc1006) main process ended, respawning
>> ==> /var/log/clusterboot/lsn-mc1008/syslog <==
>> Nov 10 10:08:17 lsn-mc1008 kernel: [6392397.277994] init: ceph-mon
>> (ceph/lsn-mc1008) main process (2271246) killed by SEGV signal
>> Nov 10 10:08:17 lsn-mc1008 kernel: [6392397.278002] init: ceph-mon
>> (ceph/lsn-mc1008) main process ended, respawning
>> ==> /var/log/clusterboot/lsn-mc1006/syslog <==
>> Nov 10 10:08:23 lsn-mc1006 kernel: [6392927.999229] init: ceph-mon
>> (ceph/lsn-mc1006) main process (2200399) killed by SEGV signal
>> Nov 10 10:08:23 lsn-mc1006 kernel: [6392927.999242] init: ceph-mon
>> (ceph/lsn-mc1006) main process ended, respawning
>> ==> /var/log/clusterboot/lsn-mc1008/syslog <==
>> Nov 10 10:08:23 lsn-mc1008 kernel: [6392403.641241] init: ceph-mon
>> (ceph/lsn-mc1008) main process (2279050) killed by SEGV signal
>> Nov 10 10:08:23 lsn-mc1008 kernel: [6392403.641254] init: ceph-mon
>> (ceph/lsn-mc1008) main process ended, respawning
>> ==> /var/log/clusterboot/lsn-mc1007/syslog <==
>> Nov 10 10:08:24 lsn-mc1007 kernel: [6392637.614495] init: ceph-mon
>> (ceph/lsn-mc1007) main process (2013418) killed by SEGV signal
>> Nov 10 10:08:24 lsn-mc1007 kernel: [6392637.614504] init: ceph-mon
>> (ceph/lsn-mc1007) main process ended, respawning
>>
>>
>> On Mon, Nov 2, 2015 at 8:35 AM, Arnulf Heimsbakk
>> <arnulf.heimsbakk@xxxxxx> wrote:
>>> When I did a unset noout on the cluster all three mons got a
>>> segmentation fault, then continued as if nothing had happened. Regular
>>> segmentation faults started on mons after upgrading to 0.94.5. Ubuntu
>>> Trusty LTS. Anyone had similar?
>>>
>>> -Arnulf
>>>
>>> Backtraces:
>>>
>>> mon1:
>>>
>>> #0  0x00007f0b2969120b in raise (sig=11)
>>>     at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:37
>>> #1  0x00000000009adfbd in reraise_fatal (signum=11)
>>>     at global/signal_handler.cc:59
>>> #2  handle_fatal_signal (signum=11) at global/signal_handler.cc:109
>>> #3  <signal handler called>
>>> #4  0x00000000006518e5 in std::_Rb_tree<std::string,
>>> std::pair<std::string const, std::string>,
>>> std::_Select1st<std::pair<std::string const, std::string> >,
>>> std::less<std::string>, std::allocator<std::pair<std::string const,
>>> std::string> > >::find (this=this@entry=0x47dac90, __k=...)
>>>     at /usr/include/c++/4.8/bits/stl_tree.h:1805
>>> #5  0x00000000008a002e in find (__x=..., this=<optimized out>)
>>>     at /usr/include/c++/4.8/bits/stl_map.h:837
>>> #6  get_str_map_key (str_map=..., key=...,
>>>     fallback_key=fallback_key@entry=0xd1d210
>>> <_ZL23CLOG_CONFIG_DEFAULT_KEY>)
>>>     at common/str_map.cc:120
>>> #7  0x00000000006b0a5a in get_facility (channel=..., this=0x47dac30)
>>>     at mon/LogMonitor.h:79
>>> #8  LogMonitor::update_from_paxos (this=0x47dab40,
>>>     need_bootstrap=<optimized out>) at mon/LogMonitor.cc:141
>>> #9  0x000000000060432a in PaxosService::refresh (this=0x47dab40,
>>>     need_bootstrap=need_bootstrap@entry=0x7f0b208b9f3f)
>>>     at mon/PaxosService.cc:128
>>> #10 0x00000000005b03db in Monitor::refresh_from_paxos (this=0x4968000,
>>>     need_bootstrap=need_bootstrap@entry=0x7f0b208b9f3f) at
>>> mon/Monitor.cc:788
>>> #11 0x00000000005eea5e in Paxos::do_refresh (this=this@entry=0x4874dc0)
>>>     at mon/Paxos.cc:1008
>>> #12 0x00000000005f5c83 in Paxos::handle_commit
>>> (this=this@entry=0x4874dc0,
>>>     commit=commit@entry=0x73a7480) at mon/Paxos.cc:933
>>> #13 0x00000000005fd7bb in Paxos::dispatch (this=0x4874dc0,
>>>     m=m@entry=0x73a7480) at mon/Paxos.cc:1399
>>> #14 0x00000000005cf9e3 in Monitor::dispatch (this=this@entry=0x4968000,
>>>     s=s@entry=0x47d7f80, m=m@entry=0x73a7480,
>>>     src_is_mon=src_is_mon@entry=true) at mon/Monitor.cc:3567
>>> #15 0x00000000005cfe36 in Monitor::_ms_dispatch
>>> (this=this@entry=0x4968000,
>>>     m=m@entry=0x73a7480) at mon/Monitor.cc:3376
>>> #16 0x00000000005edb43 in Monitor::ms_dispatch (this=0x4968000,
>>> m=0x73a7480)
>>>     at mon/Monitor.h:833
>>> #17 0x0000000000929679 in ms_deliver_dispatch (m=0x73a7480,
>>> this=0x49be700)
>>>     at ./msg/Messenger.h:567
>>> #18 DispatchQueue::entry (this=0x49be8c8) at
>>> msg/simple/DispatchQueue.cc:185
>>> #19 0x00000000007c99cd in DispatchQueue::DispatchThread::entry (
>>>     this=<optimized out>) at msg/simple/DispatchQueue.h:103
>>> #20 0x00007f0b29689182 in start_thread (arg=0x7f0b208bb700)
>>>     at pthread_create.c:312
>>> #21 0x00007f0b27bf447d in clone ()
>>>     at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
>>>
>>>
>>> mon2:
>>>
>>> #0  0x00007fd27c06520b in raise () from
>>> /lib/x86_64-linux-gnu/libpthread.so.0
>>> #1  0x00000000009adfbd in reraise_fatal (signum=11)
>>>     at global/signal_handler.cc:59
>>> #2  handle_fatal_signal (signum=11) at global/signal_handler.cc:109
>>> #3  <signal handler called>
>>> #4  0x00000000006518e5 in std::_Rb_tree<std::string,
>>> std::pair<std::string const, std::string>,
>>> std::_Select1st<std::pair<std::string const, std::string> >,
>>> std::less<std::string>, std::allocator<std::pair<std::string const,
>>> std::string> > >::find (this=this@entry=0x36a6390, __k=...)
>>>     at /usr/include/c++/4.8/bits/stl_tree.h:1805
>>> #5  0x00000000008a002e in find (__x=..., this=<optimized out>)
>>>     at /usr/include/c++/4.8/bits/stl_map.h:837
>>> #6  get_str_map_key (str_map=..., key=...,
>>>     fallback_key=fallback_key@entry=0xd1d210
>>> <_ZL23CLOG_CONFIG_DEFAULT_KEY>)
>>>     at common/str_map.cc:120
>>> #7  0x00000000006b0a5a in get_facility (channel=..., this=0x36a6330)
>>>     at mon/LogMonitor.h:79
>>> #8  LogMonitor::update_from_paxos (this=0x36a6240,
>>>     need_bootstrap=<optimized out>) at mon/LogMonitor.cc:141
>>> #9  0x000000000060432a in PaxosService::refresh (this=0x36a6240,
>>>     need_bootstrap=need_bootstrap@entry=0x7fd276f5d6af)
>>>     at mon/PaxosService.cc:128
>>> #10 0x00000000005b03db in Monitor::refresh_from_paxos (this=0x37feb00,
>>>     need_bootstrap=need_bootstrap@entry=0x7fd276f5d6af) at
>>> mon/Monitor.cc:788
>>> #11 0x00000000005eea5e in Paxos::do_refresh (this=this@entry=0x3740dc0)
>>>     at mon/Paxos.cc:1008
>>> #12 0x00000000005fbf39 in Paxos::commit_finish (this=0x3740dc0)
>>>     at mon/Paxos.cc:903
>>> #13 0x000000000060038b in C_Committed::finish (this=0x4600ad0,
>>>     r=<optimized out>) at mon/Paxos.cc:807
>>> #14 0x00000000005d4d89 in Context::complete (this=0x4600ad0,
>>>     r=<optimized out>) at ./include/Context.h:65
>>> #15 0x00000000005ff4bc in MonitorDBStore::C_DoTransaction::finish (
>>>     this=0x38258c0, r=<optimized out>) at mon/MonitorDBStore.h:326
>>> #16 0x00000000005d4d89 in Context::complete (this=0x38258c0,
>>>     r=<optimized out>) at ./include/Context.h:65
>>> #17 0x0000000000717e88 in Finisher::finisher_thread_entry (this=0x3683350)
>>>     at common/Finisher.cc:59
>>> #18 0x00007fd27c05d182 in start_thread ()
>>>    from /lib/x86_64-linux-gnu/libpthread.so.0
>>> #19 0x00007fd27a5c847d in clone () from /lib/x86_64-linux-gnu/libc.so.6
>>>
>>> mon3:
>>>
>>> #0  0x00007f4f0cfce20b in raise () from
>>> /lib/x86_64-linux-gnu/libpthread.so.0
>>> #1  0x00000000009adfbd in reraise_fatal (signum=11)
>>>     at global/signal_handler.cc:59
>>> #2  handle_fatal_signal (signum=11) at global/signal_handler.cc:109
>>> #3  <signal handler called>
>>> #4  0x00000000006518e5 in std::_Rb_tree<std::string,
>>> std::pair<std::string const, std::string>,
>>> std::_Select1st<std::pair<std::string const, std::string> >,
>>> std::less<std::string>, std::allocator<std::pair<std::string const,
>>> std::string> > >::find (this=this@entry=0x35a4c90, __k=...)
>>>     at /usr/include/c++/4.8/bits/stl_tree.h:1805
>>> #5  0x00000000008a002e in find (__x=..., this=<optimized out>)
>>>     at /usr/include/c++/4.8/bits/stl_map.h:837
>>> #6  get_str_map_key (str_map=..., key=...,
>>>     fallback_key=fallback_key@entry=0xd1d210
>>> <_ZL23CLOG_CONFIG_DEFAULT_KEY>)
>>>     at common/str_map.cc:120
>>> #7  0x00000000006b0a5a in get_facility (channel=..., this=0x35a4c30)
>>>     at mon/LogMonitor.h:79
>>> #8  LogMonitor::update_from_paxos (this=0x35a4b40,
>>>     need_bootstrap=<optimized out>) at mon/LogMonitor.cc:141
>>> #9  0x000000000060432a in PaxosService::refresh (this=0x35a4b40,
>>>     need_bootstrap=need_bootstrap@entry=0x7f4f038aef3f)
>>>     at mon/PaxosService.cc:128
>>> #10 0x00000000005b03db in Monitor::refresh_from_paxos (this=0x3d34b00,
>>>     need_bootstrap=need_bootstrap@entry=0x7f4f038aef3f) at
>>> mon/Monitor.cc:788
>>> #11 0x00000000005eea5e in Paxos::do_refresh (this=this@entry=0x363f080)
>>>     at mon/Paxos.cc:1008
>>> #12 0x00000000005f5c83 in Paxos::handle_commit
>>> (this=this@entry=0x363f080,
>>>     commit=commit@entry=0x6c1d900) at mon/Paxos.cc:933
>>> #13 0x00000000005fd7bb in Paxos::dispatch (this=0x363f080,
>>>     m=m@entry=0x6c1d900) at mon/Paxos.cc:1399
>>> #14 0x00000000005cf9e3 in Monitor::dispatch (this=this@entry=0x3d34b00,
>>>     s=s@entry=0x35a2f40, m=m@entry=0x6c1d900,
>>>     src_is_mon=src_is_mon@entry=true) at mon/Monitor.cc:3567
>>> #15 0x00000000005cfe36 in Monitor::_ms_dispatch
>>> (this=this@entry=0x3d34b00,
>>>     m=m@entry=0x6c1d900) at mon/Monitor.cc:3376
>>> #16 0x00000000005edb43 in Monitor::ms_dispatch (this=0x3d34b00,
>>> m=0x6c1d900)
>>>     at mon/Monitor.h:833
>>> #17 0x0000000000929679 in ms_deliver_dispatch (m=0x6c1d900,
>>> this=0x3ccae00)
>>>     at ./msg/Messenger.h:567
>>> #18 DispatchQueue::entry (this=0x3ccafc8) at
>>> msg/simple/DispatchQueue.cc:185
>>> #19 0x00000000007c99cd in DispatchQueue::DispatchThread::entry (
>>>     this=<optimized out>) at msg/simple/DispatchQueue.h:103
>>> #20 0x00007f4f0cfc6182 in start_thread ()
>>>    from /lib/x86_64-linux-gnu/libpthread.so.0
>>> #21 0x00007f4f0b53147d in clone () from /lib/x86_64-linux-gnu/libc.so.6
>>>
>>>
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux