Thanks Greg. Just for posterity, "ceph-kvstore-tool /var/lib/ceph/mon/store.db set auth last_committed ver 0" did the trick and we're back to HEALTH_OK. Cheers, Lincoln Bryant On Jul 18, 2014, at 4:15 PM, Gregory Farnum wrote: > Hmm, this log is just leaving me with more questions. Could you tar up > the "/var/lib/ceph/mon/store.db" (substitute actual mon store path as > necessary) and upload it for me? (you can use ceph-post-file to put it > on our servers if you prefer.) Just from the log I don't have a great > idea of what's gone wrong, but you might find that > ceph-kvstore-tool /var/lib/ceph/mon/store.db set auth last_committed ver 0 > helps. (To be perfectly honest I'm just copying that from a similar > report in the tracker at http://tracker.ceph.com/issues/8851, but > that's the approach I was planning on.) > > Nothing has changed in the monitor that should have caused issues, but > with two reports I'd like to at least see if we can do something to be > a little more robust in the face of corruption! > -Greg > Software Engineer #42 @ http://inktank.com | http://ceph.com > > On Thu, Jul 17, 2014 at 1:39 PM, Lincoln Bryant <lincolnb at uchicago.edu> wrote: >> Hi all, >> >> I tried restarting my mon today, but I find that it no longer starts. Whenever I try to fire up the mon, I get errors of this nature: >> >> -3> 2014-07-17 15:12:32.738510 7f25b0921780 10 mon.a at -1(probing).auth v1537 update_from_paxos >> -2> 2014-07-17 15:12:32.738526 7f25b0921780 10 mon.a at -1(probing).auth v1537 update_from_paxos version 1537 keys ver 0 latest 0 >> -1> 2014-07-17 15:12:32.738532 7f25b0921780 10 mon.a at -1(probing).auth v1537 update_from_paxos key server version 0 >> 0> 2014-07-17 15:12:32.739836 7f25b0921780 -1 mon/AuthMonitor.cc: In function 'virtual void AuthMonitor::update_from_paxos(bool*)' thread 7f25b0921780 time 2014-07-17 15:12:32.738549 >> mon/AuthMonitor.cc: 155: FAILED assert(ret == 0) >> >> After having a conversation with Greg in IRC, it seems that the disk state is corrupted. This seems to be CephX related, although we do not have CephX enabled on this cluster. >> >> At Greg's request, I've attached the logs in this mail to hopefully squirrel out what exactly is corrupted. I've set debug {mon,paxos,auth,keyvaluestore} to 20 in ceph.conf. >> >> I'm hoping to be able to recover -- unfortunately we've made the mistake of only deploying a single mon for this cluster, and there is some data I'd like to preserve. >> >> Thanks for any help, >> Lincoln Bryant >> >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users at lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>