Hi Sage,
I just had a 0.61.6 monitor crash and one osd. The mon and all osds
restarted just fine after the update but it decided to crash after 15
minutes orso. See a snippet of the logfile below. I have you sent a
link to the logfiles and monitor store. It seems the bug hasn't been
fully fixed or something else is going on. I have to note though that
I had one monitor with a clock skew warning for a few minutes (this
happened because of a reboot it was fixed by ntp). So beware when
upgrading.
Cheers,
mon:
--- begin dump of recent events ---
0> 2013-07-24 09:42:57.655257 7f262392e780 -1 *** Caught signal
(Aborted) **
in thread 7f262392e780
ceph version 0.61.6 (59ddece17e36fef69ecf40e239aeffad33c9db35)
1: /usr/bin/ceph-mon() [0x597cfa]
2: (()+0xfcb0) [0x7f2622fc8cb0]
3: (gsignal()+0x35) [0x7f2621b9e425]
4: (abort()+0x17b) [0x7f2621ba1b8b]
5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f26224f069d]
6: (()+0xb5846) [0x7f26224ee846]
7: (()+0xb5873) [0x7f26224ee873]
8: (()+0xb596e) [0x7f26224ee96e]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x1df) [0x64ffaf]
10: (OSDMonitor::update_from_paxos(bool*)+0x29e7) [0x507c77]
11: (PaxosService::refresh(bool*)+0x19b) [0x4ede7b]
12: (Monitor::refresh_from_paxos(bool*)+0x57) [0x48e617]
13: (Monitor::init_paxos()+0xf5) [0x48e7d5]
14: (Monitor::preinit()+0x6ac) [0x4a4e6c]
15: (main()+0x1c19) [0x4835c9]
16: (__libc_start_main()+0xed) [0x7f2621b8976d]
17: /usr/bin/ceph-mon() [0x485eed]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.
--- logging levels ---
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 1 buffer
0/ 1 timer
0/ 1 filer
0/ 1 striper
0/ 1 objecter
0/ 5 rados
0/ 5 rbd
0/ 5 journaler
0/ 5 objectcacher
0/ 5 client
0/ 5 osd
0/ 5 optracker
0/ 5 objclass
1/ 3 filestore
1/ 3 journal
0/ 5 ms
1/ 5 mon
0/10 monc
0/ 5 paxos
0/ 5 tp
1/ 5 auth
1/ 5 crypto
1/ 1 finisher
1/ 5 heartbeatmap
1/ 5 perfcounter
1/ 5 rgw
1/ 5 hadoop
1/ 5 javaclient
1/ 5 asok
1/ 1 throttle
-2/-2 (syslog threshold)
-1/-1 (stderr threshold)
max_recent 10000
max_new 1000
log_file /var/log/ceph/ceph-mon.ceph3.log
--- end dump of recent events ---
2013-07-24 09:42:57.935730 7fb08d67a780 0 ceph version 0.61.6
(59ddece17e36fef69ecf40e239aeffad33c9db35), process ceph-mon, pid
19878
2013-07-24 09:42:57.943330 7fb08d67a780 1 mon.ceph3@-1(probing) e1
preinit fsid 97e515bb-d334-4fa7-8b53-7d85615809fd
2013-07-24 09:42:57.966551 7fb08d67a780 -1 mon/OSDMonitor.cc: In
function 'virtual void OSDMonitor::update_from_paxos(bool*)' thread
7fb08d67a780 time 2013-07-24 09:42:57.964379
mon/OSDMonitor.cc: 167: FAILED assert(latest_bl.length() != 0)
ceph version 0.61.6 (59ddece17e36fef69ecf40e239aeffad33c9db35)
1: (OSDMonitor::update_from_paxos(bool*)+0x29e7) [0x507c77]
2: (PaxosService::refresh(bool*)+0x19b) [0x4ede7b]
3: (Monitor::refresh_from_paxos(bool*)+0x57) [0x48e617]
4: (Monitor::init_paxos()+0xf5) [0x48e7d5]
5: (Monitor::preinit()+0x6ac) [0x4a4e6c]
6: (main()+0x1c19) [0x4835c9]
7: (__libc_start_main()+0xed) [0x7fb08b8d576d]
8: /usr/bin/ceph-mon() [0x485eed]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.
--- begin dump of recent events ---
-25> 2013-07-24 09:42:57.933545 7fb08d67a780 5 asok(0x1a1e000)
register_command perfcounters_dump hook 0x1a13010
-24> 2013-07-24 09:42:57.933581 7fb08d67a780 5 asok(0x1a1e000)
register_command 1 hook 0x1a13010
-23> 2013-07-24 09:42:57.933584 7fb08d67a780 5 asok(0x1a1e000)
register_command perf dump hook 0x1a13010
-22> 2013-07-24 09:42:57.933592 7fb08d67a780 5 asok(0x1a1e000)
register_command perfcounters_schema hook 0x1a13010
-21> 2013-07-24 09:42:57.933595 7fb08d67a780 5 asok(0x1a1e000)
register_command 2 hook 0x1a13010
-20> 2013-07-24 09:42:57.933597 7fb08d67a780 5 asok(0x1a1e000)
register_command perf schema hook 0x1a13010
-19> 2013-07-24 09:42:57.933601 7fb08d67a780 5 asok(0x1a1e000)
register_command config show hook 0x1a13010
-18> 2013-07-24 09:42:57.933604 7fb08d67a780 5 asok(0x1a1e000)
register_command config set hook 0x1a13010
-17> 2013-07-24 09:42:57.933606 7fb08d67a780 5 asok(0x1a1e000)
register_command log flush hook 0x1a13010
-16> 2013-07-24 09:42:57.933609 7fb08d67a780 5 asok(0x1a1e000)
register_command log dump hook 0x1a13010
-15> 2013-07-24 09:42:57.933612 7fb08d67a780 5 asok(0x1a1e000)
register_command log reopen hook 0x1a13010
-14> 2013-07-24 09:42:57.935730 7fb08d67a780 0 ceph version
0.61.6 (59ddece17e36fef69ecf40e239aeffad33c9db35), process ceph-mon,
pid 19878
-13> 2013-07-24 09:42:57.938697 7fb08d67a780 5 asok(0x1a1e000)
init /var/run/ceph/ceph-mon.ceph3.asok
-12> 2013-07-24 09:42:57.938732 7fb08d67a780 5 asok(0x1a1e000)
bind_and_listen /var/run/ceph/ceph-mon.ceph3.asok
-11> 2013-07-24 09:42:57.938775 7fb08d67a780 5 asok(0x1a1e000)
register_command 0 hook 0x1a120b0
-10> 2013-07-24 09:42:57.938793 7fb08d67a780 5 asok(0x1a1e000)
register_command version hook 0x1a120b0
-9> 2013-07-24 09:42:57.938804 7fb08d67a780 5 asok(0x1a1e000)
register_command git_version hook 0x1a120b0
-8> 2013-07-24 09:42:57.938815 7fb08d67a780 5 asok(0x1a1e000)
register_command help hook 0x1a130d0
-7> 2013-07-24 09:42:57.940462 7fb08971b700 5 asok(0x1a1e000)
entry start
-6> 2013-07-24 09:42:57.943200 7fb08d67a780 1 --
10.255.0.30:6789/0 learned my addr 10.255.0.30:6789/0
-5> 2013-07-24 09:42:57.943227 7fb08d67a780 1
accepter.accepter.bind my_inst.addr is 10.255.0.30:6789/0 need_addr=0
-4> 2013-07-24 09:42:57.943265 7fb08d67a780 5 adding auth
protocol: cephx
-3> 2013-07-24 09:42:57.943279 7fb08d67a780 5 adding auth
protocol: cephx
-2> 2013-07-24 09:42:57.943330 7fb08d67a780 1
mon.ceph3@-1(probing) e1 preinit fsid
97e515bb-d334-4fa7-8b53-7d85615809fd
-1> 2013-07-24 09:42:57.963915 7fb08d67a780 4
mon.ceph3@-1(probing).mds e228053 new map
0> 2013-07-24 09:42:57.966551 7fb08d67a780 -1 mon/OSDMonitor.cc:
In function 'virtual void OSDMonitor::update_from_paxos(bool*)' thread
7fb08d67a780 time 2013-07-24 09:42:57.964379
mon/OSDMonitor.cc: 167: FAILED assert(latest_bl.length() != 0)
ceph version 0.61.6 (59ddece17e36fef69ecf40e239aeffad33c9db35)
1: (OSDMonitor::update_from_paxos(bool*)+0x29e7) [0x507c77]
2: (PaxosService::refresh(bool*)+0x19b) [0x4ede7b]
3: (Monitor::refresh_from_paxos(bool*)+0x57) [0x48e617]
4: (Monitor::init_paxos()+0xf5) [0x48e7d5]
5: (Monitor::preinit()+0x6ac) [0x4a4e6c]
6: (main()+0x1c19) [0x4835c9]
7: (__libc_start_main()+0xed) [0x7fb08b8d576d]
8: /usr/bin/ceph-mon() [0x485eed]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.
--- logging levels ---
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 1 buffer
0/ 1 timer
0/ 1 filer
0/ 1 striper
0/ 1 objecter
0/ 5 rados
0/ 5 rbd
0/ 5 journaler
0/ 5 objectcacher
0/ 5 client
0/ 5 osd
0/ 5 optracker
0/ 5 objclass
1/ 3 filestore
1/ 3 journal
0/ 5 ms
1/ 5 mon
0/10 monc
0/ 5 paxos
0/ 5 tp
1/ 5 auth
1/ 5 crypto
1/ 1 finisher
1/ 5 heartbeatmap
1/ 5 perfcounter
1/ 5 rgw
1/ 5 hadoop
1/ 5 javaclient
1/ 5 asok
1/ 1 throttle
-2/-2 (syslog threshold)
-1/-1 (stderr threshold)
max_recent 10000
max_new 1000
log_file /var/log/ceph/ceph-mon.ceph3.log
--- end dump of recent events ---
2013-07-24 09:42:57.973296 7fb08d67a780 -1 *** Caught signal (Aborted) **
in thread 7fb08d67a780
ceph version 0.61.6 (59ddece17e36fef69ecf40e239aeffad33c9db35)
1: /usr/bin/ceph-mon() [0x597cfa]
2: (()+0xfcb0) [0x7fb08cd14cb0]
3: (gsignal()+0x35) [0x7fb08b8ea425]
4: (abort()+0x17b) [0x7fb08b8edb8b]
5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fb08c23c69d]
6: (()+0xb5846) [0x7fb08c23a846]
7: (()+0xb5873) [0x7fb08c23a873]
8: (()+0xb596e) [0x7fb08c23a96e]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x1df) [0x64ffaf]
10: (OSDMonitor::update_from_paxos(bool*)+0x29e7) [0x507c77]
11: (PaxosService::refresh(bool*)+0x19b) [0x4ede7b]
12: (Monitor::refresh_from_paxos(bool*)+0x57) [0x48e617]
13: (Monitor::init_paxos()+0xf5) [0x48e7d5]
14: (Monitor::preinit()+0x6ac) [0x4a4e6c]
15: (main()+0x1c19) [0x4835c9]
16: (__libc_start_main()+0xed) [0x7fb08b8d576d]
17: /usr/bin/ceph-mon() [0x485eed]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.
--- begin dump of recent events ---
0> 2013-07-24 09:42:57.973296 7fb08d67a780 -1 *** Caught signal
(Aborted) **
in thread 7fb08d67a780
ceph version 0.61.6 (59ddece17e36fef69ecf40e239aeffad33c9db35)
1: /usr/bin/ceph-mon() [0x597cfa]
2: (()+0xfcb0) [0x7fb08cd14cb0]
3: (gsignal()+0x35) [0x7fb08b8ea425]
4: (abort()+0x17b) [0x7fb08b8edb8b]
5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fb08c23c69d]
6: (()+0xb5846) [0x7fb08c23a846]
7: (()+0xb5873) [0x7fb08c23a873]
8: (()+0xb596e) [0x7fb08c23a96e]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x1df) [0x64ffaf]
10: (OSDMonitor::update_from_paxos(bool*)+0x29e7) [0x507c77]
11: (PaxosService::refresh(bool*)+0x19b) [0x4ede7b]
12: (Monitor::refresh_from_paxos(bool*)+0x57) [0x48e617]
13: (Monitor::init_paxos()+0xf5) [0x48e7d5]
14: (Monitor::preinit()+0x6ac) [0x4a4e6c]
15: (main()+0x1c19) [0x4835c9]
16: (__libc_start_main()+0xed) [0x7fb08b8d576d]
17: /usr/bin/ceph-mon() [0x485eed]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.
--- logging levels ---
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 1 buffer
0/ 1 timer
0/ 1 filer
0/ 1 striper
0/ 1 objecter
0/ 5 rados
0/ 5 rbd
0/ 5 journaler
0/ 5 objectcacher
0/ 5 client
0/ 5 osd
0/ 5 optracker
0/ 5 objclass
1/ 3 filestore
1/ 3 journal
0/ 5 ms
1/ 5 mon
0/10 monc
0/ 5 paxos
0/ 5 tp
1/ 5 auth
1/ 5 crypto
1/ 1 finisher
1/ 5 heartbeatmap
1/ 5 perfcounter
1/ 5 rgw
1/ 5 hadoop
1/ 5 javaclient
1/ 5 asok
1/ 1 throttle
-2/-2 (syslog threshold)
-1/-1 (stderr threshold)
max_recent 10000
max_new 1000
log_file /var/log/ceph/ceph-mon.ceph3.log
--- end dump of recent events ---
osd:
--- begin dump of recent events ---
0> 2013-07-24 09:21:34.087645 7f3c2fa7d700 -1 *** Caught signal
(Aborted) **
in thread 7f3c2fa7d700
ceph version 0.61.6 (59ddece17e36fef69ecf40e239aeffad33c9db35)
1: /usr/bin/ceph-osd() [0x79219a]
2: (()+0xfcb0) [0x7f3c40aaacb0]
3: (gsignal()+0x35) [0x7f3c3f263425]
4: (abort()+0x17b) [0x7f3c3f266b8b]
5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f3c3fbb569d]
6: (()+0xb5846) [0x7f3c3fbb3846]
7: (()+0xb5873) [0x7f3c3fbb3873]
8: (()+0xb596e) [0x7f3c3fbb396e]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x1df) [0x84303f]
10: (OSDService::get_map(unsigned int)+0x428) [0x630c88]
11: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&,
PG::RecoveryCtx*, std::set<boost::intrusive_ptr<PG>,
std::less<boost::intrusive_ptr<PG> >,
std::allocator<boost::intrusive_ptr<PG> > >*)+0x11d) [0x6327bd]
12: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*>
const&, ThreadPool::TPHandle&)+0x244) [0x632f14]
13: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> >
const&, ThreadPool::TPHandle&)+0x12) [0x66e2d2]
14: (ThreadPool::worker(ThreadPool::WorkThread*)+0x4e6) [0x838476]
15: (ThreadPool::WorkThread::entry()+0x10) [0x83a2a0]
16: (()+0x7e9a) [0x7f3c40aa2e9a]
17: (clone()+0x6d) [0x7f3c3f320ccd]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.
--- logging levels ---
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 1 buffer
0/ 1 timer
0/ 1 filer
0/ 1 striper
0/ 1 objecter
0/ 5 rados
0/ 5 rbd
0/ 5 journaler
0/ 5 objectcacher
0/ 5 client
0/ 5 osd
0/ 5 optracker
0/ 5 objclass
1/ 3 filestore
1/ 3 journal
0/ 5 ms
1/ 5 mon
0/10 monc
0/ 5 paxos
0/ 5 tp
1/ 5 auth
1/ 5 crypto
1/ 1 finisher
1/ 5 heartbeatmap
1/ 5 perfcounter
1/ 5 rgw
1/ 5 hadoop
1/ 5 javaclient
1/ 5 asok
1/ 1 throttle
-2/-2 (syslog threshold)
-1/-1 (stderr threshold)
max_recent 10000
max_new 1000
log_file /var/log/ceph/ceph-osd.6.log
--- end dump of recent events ---
On 2013-07-24 06:47, Sage Weil wrote:
There was a problem with the monitor daemons in v0.61.5 that would
prevent
them from restarting after some period of time. This release fixes the
bug and works around the issue to allow affected monitors to restart.
All v0.61.5 users are strongly recommended to upgrade.
Thanks everyone who helped track the problem down!
Notable changes:
* mon: record latest full osdmap
* mon: work around previous bug in which latest full osdmap was not
recorded
* mon: avoid scrub while paxos is updating
For more information please see the complete release notes:
* http://ceph.com/docs/master/release-notes/#v0-61-6-cuttlefish
You can get v0.61.6 from the usual locations:
* Git at git://github.com/ceph/ceph.git
* Tarball at http://ceph.com/download/ceph-0.61.6.tar.gz
* For Debian/Ubuntu packages, see
http://ceph.com/docs/master/install/debian
* For RPMs, see http://ceph.com/docs/master/install/rpm
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com