Mon won't start, possibly due to corrupt disk?

greg@xxxxxxxxxxx (Gregory Farnum) · Fri, 18 Jul 2014 14:15:33 -0700

Hmm, this log is just leaving me with more questions. Could you tar up
the "/var/lib/ceph/mon/store.db" (substitute actual mon store path as
necessary) and upload it for me? (you can use ceph-post-file to put it
on our servers if you prefer.) Just from the log I don't have a great
idea of what's gone wrong, but you might find that
ceph-kvstore-tool /var/lib/ceph/mon/store.db set auth last_committed ver 0
helps. (To be perfectly honest I'm just copying that from a similar
report in the tracker at http://tracker.ceph.com/issues/8851, but
that's the approach I was planning on.)

Nothing has changed in the monitor that should have caused issues, but
with two reports I'd like to at least see if we can do something to be
a little more robust in the face of corruption!
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com

On Thu, Jul 17, 2014 at 1:39 PM, Lincoln Bryant <lincolnb at uchicago.edu> wrote:
> Hi all,
>
> I tried restarting my mon today, but I find that it no longer starts. Whenever I try to fire up the mon, I get errors of this nature:
>
>    -3> 2014-07-17 15:12:32.738510 7f25b0921780 10 mon.a at -1(probing).auth v1537 update_from_paxos
>    -2> 2014-07-17 15:12:32.738526 7f25b0921780 10 mon.a at -1(probing).auth v1537 update_from_paxos version 1537 keys ver 0 latest 0
>    -1> 2014-07-17 15:12:32.738532 7f25b0921780 10 mon.a at -1(probing).auth v1537 update_from_paxos key server version 0
>     0> 2014-07-17 15:12:32.739836 7f25b0921780 -1 mon/AuthMonitor.cc: In function 'virtual void AuthMonitor::update_from_paxos(bool*)' thread 7f25b0921780 time 2014-07-17 15:12:32.738549
> mon/AuthMonitor.cc: 155: FAILED assert(ret == 0)
>
> After having a conversation with Greg in IRC, it seems that the disk state is corrupted. This seems to be CephX related, although we do not have CephX enabled on this cluster.
>
> At Greg's request, I've attached the logs in this mail to hopefully squirrel out what exactly is corrupted. I've set debug {mon,paxos,auth,keyvaluestore} to 20 in ceph.conf.
>
> I'm hoping to be able to recover -- unfortunately we've made the mistake of only deploying a single mon for this cluster, and there is some data I'd like to preserve.
>
> Thanks for any help,
> Lincoln Bryant
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>