On 2013-07-25 15:21, Joao Eduardo Luis wrote:
On 07/25/2013 11:20 AM, peter@xxxxxxxxx wrote:
On 2013-07-25 12:08, Wido den Hollander wrote:
On 07/25/2013 12:01 PM, peter@xxxxxxxxx wrote:
On 2013-07-25 11:52, Wido den Hollander wrote:
On 07/25/2013 11:46 AM, peter@xxxxxxxxx wrote:
Any news on this? I'm not sure if you guys received the link to
the
log
and monitor files. One monitor and osd is still crashing with the
error
below.
I think you are seeing this issue:
http://tracker.ceph.com/issues/5737
You can try with new packages from here:
http://gitbuilder.ceph.com/ceph-deb-precise-x86_64-basic/ref/wip-5737-cuttlefish/
That should resolve it.
Wido
Hi Wido,
This is the same issue I reported earlier with 0.61.5. I applied
the
above package and the problem was solved. Then 0.61.6 was released
with
a fix for this issue. I installed 0.61.6 and the issue is back on
one of
my monitors and I have one osd crashing. So, it seems the bug is
still
there in 0.61.6 or it is a new bug. It seems the guys from Inktank
haven't picked this up yet.
It has been picked up, Sage mentioned this yesterday on the dev
list:
"This is fixed in the cuttlefish branch as of earlier this
afternoon.
I've spent most of the day expanding the automated test suite to
include upgrade combinations to trigger this and *finally* figured
out
that this particular problem seems to surface on clusters that
upgraded from bobtail-> cuttlefish but not clusters created on
cuttlefish.
If you've run into this issue, please use the cuttlefish branch
build
for now. We will have a release out in the next day or so that
includes this and a few other pending fixes.
I'm sorry we missed this one! The upgrade test matrix I've been
working on today should catch this type of issue in the future."
Wido
Regards,
We created this cluster on cuttlefish and not on bobtail so it
doesn't
apply. I'm not sure if it is clear what I am trying to say or that
I'm
missing something here but I still see this issue either way :-)
I will check out the dev list also but perhaps someone from Inktank
can
at least look at the files I provided.
Peter,
We did take a look at your files (thanks a lot btw!), and as of last
night's patches (which are now on the cuttlefish branch), your store
worked just fine.
As Sage mentioned on ceph-devel, one of the issues would only happen
on a bobtail -> cuttlefish cluster. That is not your issue though. I
believe Sage meant the FAILED assert(latest_full > 0) -- i.e., the one
reported on #5737.
Your issue however was caused by a bug on a patch meant to fix #5704.
It made an on-disk key to be updated erroneously with a value for a
version that did not yet existed at the time update_from_paxos() was
called. In a nutshell, one of the latest patches (see
115468c73f121653eec2efc030d5ba998d834e43) fixed that issue and another
patch (see 27f31895664fa7f10c1617d486f2a6ece0f97091) worked around it.
A point-release should come out soon, but in the mean time the
cuttlefish branch should be safe to use.
If you run into any other issues, please let us know.
-Joao
Hi Joao,
I installed the packages from that branch but I still see the same
crashes:
root@ceph3:~/ceph# ceph-mon -v
ceph version 0.61.6-1-g28720b0
(28720b0b4d55ef98f3b7d0855b18339e75f759e3)
root@ceph3:~/ceph# ceph-osd -v
ceph version 0.61.6-1-g28720b0
(28720b0b4d55ef98f3b7d0855b18339e75f759e3)
Both monitor and one of three osds (on that host) still crash on
startup. I must be doing something wrong if it works for you...
OSD:
--- begin dump of recent events ---
0> 2013-07-25 15:35:32.563404 7f8172241700 -1 *** Caught signal
(Aborted) **
in thread 7f8172241700
ceph version 0.61.6-1-g28720b0
(28720b0b4d55ef98f3b7d0855b18339e75f759e3)
1: /usr/bin/ceph-osd() [0x79430a]
2: (()+0xfcb0) [0x7f81833e1cb0]
3: (gsignal()+0x35) [0x7f81814af425]
4: (abort()+0x17b) [0x7f81814b2b8b]
5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f8181e0169d]
6: (()+0xb5846) [0x7f8181dff846]
7: (()+0xb5873) [0x7f8181dff873]
8: (()+0xb596e) [0x7f8181dff96e]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x1df) [0x84618f]
10: (OSDService::get_map(unsigned int)+0x428) [0x63bc48]
11: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&,
PG::RecoveryCtx*, std::set<boost::intrusive_ptr<PG>,
std::less<boost::intrusive_ptr<PG> >,
std::allocator<boost::intrusive_ptr<PG> > >*)+0x11d) [0x63d77d]
12: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> >
const&, ThreadPool::TPHandle&)+0x244) [0x63ded4]
13: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> >
const&, ThreadPool::TPHandle&)+0x12) [0x678c52]
14: (ThreadPool::worker(ThreadPool::WorkThread*)+0x4e6) [0x83b5c6]
15: (ThreadPool::WorkThread::entry()+0x10) [0x83d3f0]
16: (()+0x7e9a) [0x7f81833d9e9a]
17: (clone()+0x6d) [0x7f818156cccd]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.
--- logging levels ---
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 1 buffer
0/ 1 timer
0/ 1 filer
0/ 1 striper
0/ 1 objecter
0/ 5 rados
0/ 5 rbd
0/ 5 journaler
0/ 5 objectcacher
0/ 5 client
0/ 5 osd
0/ 5 optracker
0/ 5 objclass
1/ 3 filestore
1/ 3 journal
0/ 5 ms
1/ 5 mon
0/10 monc
0/ 5 paxos
0/ 5 tp
1/ 5 auth
1/ 5 crypto
1/ 1 finisher
1/ 5 heartbeatmap
1/ 5 perfcounter
1/ 5 rgw
1/ 5 hadoop
1/ 5 javaclient
1/ 5 asok
1/ 1 throttle
-2/-2 (syslog threshold)
-1/-1 (stderr threshold)
max_recent 10000
max_new 1000
log_file /var/log/ceph/ceph-osd.6.log
--- end dump of recent events ---
MON:
--- end dump of recent events ---
2013-07-25 15:34:14.171423 7fca27f0e780 0 ceph version
0.61.6-1-g28720b0 (28720b0b4d55ef98f3b7d0855b18339e75f759e3), process
ceph-mon, pid 18182
2013-07-25 15:34:14.181109 7fca27f0e780 1 mon.ceph3@-1(probing) e1
preinit fsid 97e515bb-d334-4fa7-8b53-7d85615809fd
2013-07-25 15:34:14.203909 7fca27f0e780 -1 mon/OSDMonitor.cc: In
function 'virtual void OSDMonitor::update_from_paxos(bool*)' thread
7fca27f0e780 time 2013-07-25 15:34:14.201568
mon/OSDMonitor.cc: 167: FAILED assert(latest_bl.length() != 0)
ceph version 0.61.6-1-g28720b0
(28720b0b4d55ef98f3b7d0855b18339e75f759e3)
1: (OSDMonitor::update_from_paxos(bool*)+0x29e7) [0x512b07]
2: (PaxosService::refresh(bool*)+0x19b) [0x4f925b]
3: (Monitor::refresh_from_paxos(bool*)+0x57) [0x49a5b7]
4: (Monitor::init_paxos()+0xe5) [0x49a755]
5: (Monitor::preinit()+0x6ac) [0x4b0a6c]
6: (main()+0x1c19) [0x48f799]
7: (__libc_start_main()+0xed) [0x7fca25fc276d]
8: /usr/bin/ceph-mon() [0x4920bd]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.
--- begin dump of recent events ---
-25> 2013-07-25 15:34:14.169227 7fca27f0e780 5 asok(0x18fe000)
register_command perfcounters_dump hook 0x18f3010
-24> 2013-07-25 15:34:14.169262 7fca27f0e780 5 asok(0x18fe000)
register_command 1 hook 0x18f3010
-23> 2013-07-25 15:34:14.169267 7fca27f0e780 5 asok(0x18fe000)
register_command perf dump hook 0x18f3010
-22> 2013-07-25 15:34:14.169275 7fca27f0e780 5 asok(0x18fe000)
register_command perfcounters_schema hook 0x18f3010
-21> 2013-07-25 15:34:14.169278 7fca27f0e780 5 asok(0x18fe000)
register_command 2 hook 0x18f3010
-20> 2013-07-25 15:34:14.169280 7fca27f0e780 5 asok(0x18fe000)
register_command perf schema hook 0x18f3010
-19> 2013-07-25 15:34:14.169284 7fca27f0e780 5 asok(0x18fe000)
register_command config show hook 0x18f3010
-18> 2013-07-25 15:34:14.169287 7fca27f0e780 5 asok(0x18fe000)
register_command config set hook 0x18f3010
-17> 2013-07-25 15:34:14.169290 7fca27f0e780 5 asok(0x18fe000)
register_command log flush hook 0x18f3010
-16> 2013-07-25 15:34:14.169292 7fca27f0e780 5 asok(0x18fe000)
register_command log dump hook 0x18f3010
-15> 2013-07-25 15:34:14.169295 7fca27f0e780 5 asok(0x18fe000)
register_command log reopen hook 0x18f3010
-14> 2013-07-25 15:34:14.171423 7fca27f0e780 0 ceph version
0.61.6-1-g28720b0 (28720b0b4d55ef98f3b7d0855b18339e75f759e3), process
ceph-mon, pid 18182
-13> 2013-07-25 15:34:14.171523 7fca27f0e780 5 asok(0x18fe000) init
/var/run/ceph/ceph-mon.ceph3.asok
-12> 2013-07-25 15:34:14.171540 7fca27f0e780 5 asok(0x18fe000)
bind_and_listen /var/run/ceph/ceph-mon.ceph3.asok
-11> 2013-07-25 15:34:14.171571 7fca27f0e780 5 asok(0x18fe000)
register_command 0 hook 0x18f2030
-10> 2013-07-25 15:34:14.171576 7fca27f0e780 5 asok(0x18fe000)
register_command version hook 0x18f2030
-9> 2013-07-25 15:34:14.171580 7fca27f0e780 5 asok(0x18fe000)
register_command git_version hook 0x18f2030
-8> 2013-07-25 15:34:14.171583 7fca27f0e780 5 asok(0x18fe000)
register_command help hook 0x18f3050
-7> 2013-07-25 15:34:14.173270 7fca24d85700 5 asok(0x18fe000)
entry start
-6> 2013-07-25 15:34:14.180955 7fca27f0e780 1 --
10.255.0.30:6789/0 learned my addr 10.255.0.30:6789/0
-5> 2013-07-25 15:34:14.180994 7fca27f0e780 1
accepter.accepter.bind my_inst.addr is 10.255.0.30:6789/0 need_addr=0
-4> 2013-07-25 15:34:14.181041 7fca27f0e780 5 adding auth
protocol: cephx
-3> 2013-07-25 15:34:14.181054 7fca27f0e780 5 adding auth
protocol: cephx
-2> 2013-07-25 15:34:14.181109 7fca27f0e780 1
mon.ceph3@-1(probing) e1 preinit fsid
97e515bb-d334-4fa7-8b53-7d85615809fd
-1> 2013-07-25 15:34:14.201153 7fca27f0e780 4
mon.ceph3@-1(probing).mds e228053 new map
0> 2013-07-25 15:34:14.203909 7fca27f0e780 -1 mon/OSDMonitor.cc:
In function 'virtual void OSDMonitor::update_from_paxos(bool*)' thread
7fca27f0e780 time 2013-07-25 15:34:14.201568
mon/OSDMonitor.cc: 167: FAILED assert(latest_bl.length() != 0)
ceph version 0.61.6-1-g28720b0
(28720b0b4d55ef98f3b7d0855b18339e75f759e3)
1: (OSDMonitor::update_from_paxos(bool*)+0x29e7) [0x512b07]
2: (PaxosService::refresh(bool*)+0x19b) [0x4f925b]
3: (Monitor::refresh_from_paxos(bool*)+0x57) [0x49a5b7]
4: (Monitor::init_paxos()+0xe5) [0x49a755]
5: (Monitor::preinit()+0x6ac) [0x4b0a6c]
6: (main()+0x1c19) [0x48f799]
7: (__libc_start_main()+0xed) [0x7fca25fc276d]
8: /usr/bin/ceph-mon() [0x4920bd]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.
--- logging levels ---
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 1 buffer
0/ 1 timer
0/ 1 filer
0/ 1 striper
0/ 1 objecter
0/ 5 rados
0/ 5 rbd
0/ 5 journaler
0/ 5 objectcacher
0/ 5 client
0/ 5 osd
0/ 5 optracker
0/ 5 objclass
1/ 3 filestore
1/ 3 journal
0/ 5 ms
1/ 5 mon
0/10 monc
0/ 5 paxos
0/ 5 tp
1/ 5 auth
1/ 5 crypto
1/ 1 finisher
1/ 5 heartbeatmap
1/ 5 perfcounter
1/ 5 rgw
1/ 5 hadoop
1/ 5 javaclient
1/ 5 asok
1/ 1 throttle
-2/-2 (syslog threshold)
-1/-1 (stderr threshold)
max_recent 10000
max_new 1000
log_file /var/log/ceph/ceph-mon.ceph3.log
--- end dump of recent events ---
2013-07-25 15:34:14.210161 7fca27f0e780 -1 *** Caught signal (Aborted)
**
in thread 7fca27f0e780
ceph version 0.61.6-1-g28720b0
(28720b0b4d55ef98f3b7d0855b18339e75f759e3)
1: /usr/bin/ceph-mon() [0x59f1ca]
2: (()+0xfcb0) [0x7fca27aeccb0]
3: (gsignal()+0x35) [0x7fca25fd7425]
4: (abort()+0x17b) [0x7fca25fdab8b]
5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fca2692969d]
6: (()+0xb5846) [0x7fca26927846]
7: (()+0xb5873) [0x7fca26927873]
8: (()+0xb596e) [0x7fca2692796e]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x1df) [0x65873f]
10: (OSDMonitor::update_from_paxos(bool*)+0x29e7) [0x512b07]
11: (PaxosService::refresh(bool*)+0x19b) [0x4f925b]
12: (Monitor::refresh_from_paxos(bool*)+0x57) [0x49a5b7]
13: (Monitor::init_paxos()+0xe5) [0x49a755]
14: (Monitor::preinit()+0x6ac) [0x4b0a6c]
15: (main()+0x1c19) [0x48f799]
16: (__libc_start_main()+0xed) [0x7fca25fc276d]
17: /usr/bin/ceph-mon() [0x4920bd]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.
--- begin dump of recent events ---
0> 2013-07-25 15:34:14.210161 7fca27f0e780 -1 *** Caught signal
(Aborted) **
in thread 7fca27f0e780
ceph version 0.61.6-1-g28720b0
(28720b0b4d55ef98f3b7d0855b18339e75f759e3)
1: /usr/bin/ceph-mon() [0x59f1ca]
2: (()+0xfcb0) [0x7fca27aeccb0]
3: (gsignal()+0x35) [0x7fca25fd7425]
4: (abort()+0x17b) [0x7fca25fdab8b]
5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fca2692969d]
6: (()+0xb5846) [0x7fca26927846]
7: (()+0xb5873) [0x7fca26927873]
8: (()+0xb596e) [0x7fca2692796e]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x1df) [0x65873f]
10: (OSDMonitor::update_from_paxos(bool*)+0x29e7) [0x512b07]
11: (PaxosService::refresh(bool*)+0x19b) [0x4f925b]
12: (Monitor::refresh_from_paxos(bool*)+0x57) [0x49a5b7]
13: (Monitor::init_paxos()+0xe5) [0x49a755]
14: (Monitor::preinit()+0x6ac) [0x4b0a6c]
15: (main()+0x1c19) [0x48f799]
16: (__libc_start_main()+0xed) [0x7fca25fc276d]
17: /usr/bin/ceph-mon() [0x4920bd]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.
--- logging levels ---
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 1 buffer
0/ 1 timer
0/ 1 filer
0/ 1 striper
0/ 1 objecter
0/ 5 rados
0/ 5 rbd
0/ 5 journaler
0/ 5 objectcacher
0/ 5 client
0/ 5 osd
0/ 5 optracker
0/ 5 objclass
1/ 3 filestore
1/ 3 journal
0/ 5 ms
1/ 5 mon
0/10 monc
0/ 5 paxos
0/ 5 tp
1/ 5 auth
1/ 5 crypto
1/ 1 finisher
1/ 5 heartbeatmap
1/ 5 perfcounter
1/ 5 rgw
1/ 5 hadoop
1/ 5 javaclient
1/ 5 asok
1/ 1 throttle
-2/-2 (syslog threshold)
-1/-1 (stderr threshold)
max_recent 10000
max_new 1000
log_file /var/log/ceph/ceph-mon.ceph3.log
--- end dump of recent events ---
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com