On 04/18/2013 10:49 PM, Gregory Farnum wrote:
On Thu, Apr 18, 2013 at 2:46 PM, Joao Eduardo Luis
<joao.luis@xxxxxxxxxxx> wrote:
On 04/18/2013 10:36 PM, Gregory Farnum wrote:
(I believe your monitor crash is something else, Matthew; if that
hasn't been dealt with yet. Unfortunately all that log has is
messages, so it probably needs a bit more. Can you check it out, Joao?
The stack trace below is #3495, and Matthew is already testing the fix (as
per the tracker, so far so good, but we we should know more in the next day
or so).
It appears to be a follower which ends up in propose_pending, which is
distinctly odd...)
I might be missing something, but what gave you that impression? That would
certainly be odd (to say the least!)
I could have just missed some message traffic (or misread what's
there), but there is a pont where I think it's forwarding a command to
the leader, and the crash is in propose_pending. I like your answers
better. ;)
-Greg
There's definitely some command messages being forwarded, but AFAICT
they're being forwarded to the monitor, not by the monitor, which by
itself is a good omen towards the monitor being the leader :-)
In any case, nothing in the trace's code path indicates we could be a
peon, unless the monitor itself believed to be the leader. If you take
a closer look, you'll see that we come from 'handle_last()', which is
bound to happen only on the leader (we'll assert otherwise). For the
monitor to be receiving these messages it must mean the peons believe
him to be the leader -- or we have so many bugs going around that it's
just madness!
In all seriousness, when I was chasing after this bug, Matthew sent me
his logs with higher debug levels -- no craziness going around :-)
-Joao
-Joao
Thanks for the bug report!
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
On Mon, Apr 8, 2013 at 7:39 AM, Mike Dawson <mdawson@xxxxxxxxxxxxx> wrote:
Matthew,
I have seen the same behavior on 0.59. Ran through some troubleshooting
with
Dan and Joao on March 21st and 22nd, but I haven't looked at it since
then.
If you look at running processes, I believe you'll see an instance of
ceph-create-keys start each time you start a Monitor. So, if you restart
the
monitor several times, you'll have several ceph-create-keys processes
piling, essentially leaking processes. IIRC, the tmp files you see in
/etc/ceph correspond with the ceph-create-keys PID. Can you confirm
that's
what you are seeing?
I haven't looked in a couple weeks, but I hope to start 0.60 later today.
- Mike
On 4/8/2013 12:43 AM, Matthew Roy wrote:
I'm seeing weird messages in my monitor logs that don't correlate to
admin activity:
2013-04-07 22:54:11.528871 7f2e9e6c8700 1 --
[2001:<something>::20]:6789/0 --> [2001:<something>::20]:0/1920 --
mon_command_ack([auth,get-or-create,client.admin,mon,allow *,osd,allow
*,mds,allow]=-13 access denied v134192) v1 -- ?+0 0x37bfc00 con
0x3716840
It's also writing out a bunch of empty files along the lines of
"ceph.client.admin.keyring.1008.tmp" in /etc/ceph/ Could this be related
to the mon trying to "Starting ceph-create-keys" when starting?
This could be the cause of, or just associated with, some general
instability of the monitor cluster. After increasing the logging level I
did catch one crash:
ceph version 0.60 (f26f7a39021dbf440c28d6375222e21c94fe8e5c)
1: /usr/bin/ceph-mon() [0x5834fa]
2: (()+0xfcb0) [0x7f4b03328cb0]
3: (gsignal()+0x35) [0x7f4b01efe425]
4: (abort()+0x17b) [0x7f4b01f01b8b]
5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f4b0285069d]
6: (()+0xb5846) [0x7f4b0284e846]
7: (()+0xb5873) [0x7f4b0284e873]
8: (()+0xb596e) [0x7f4b0284e96e]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x1df) [0x636c8f]
10: (PaxosService::propose_pending()+0x46d) [0x4dee3d]
11: (MDSMonitor::tick()+0x1c62) [0x51cdd2]
12: (MDSMonitor::on_active()+0x1a) [0x512ada]
13: (PaxosService::_active()+0x31d) [0x4e067d]
14: (Context::complete(int)+0xa) [0x4b7b4a]
15: (finish_contexts(CephContext*, std::list<Context*,
std::allocator<Context*> >&, int)+0x95) [0x4ba5a5]
16: (Paxos::handle_last(MMonPaxos*)+0xbef) [0x4da92f]
17: (Paxos::dispatch(PaxosServiceMessage*)+0x26b) [0x4dad8b]
18: (Monitor::_ms_dispatch(Message*)+0x149f) [0x4b310f]
19: (Monitor::ms_dispatch(Message*)+0x32) [0x4c9d12]
20: (DispatchQueue::entry()+0x341) [0x698da1]
21: (DispatchQueue::DispatchThread::entry()+0xd) [0x626c5d]
22: (()+0x7e9a) [0x7f4b03320e9a]
23: (clone()+0x6d) [0x7f4b01fbbcbd]
The complete log is at: http://goo.gl/UmNs3
Does anyone recognize what's going on?
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Joao Eduardo Luis
Software Engineer | http://inktank.com | http://ceph.com
--
Joao Eduardo Luis
Software Engineer | http://inktank.com | http://ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com