Keep in mind that this has thrown out all the auth info in your cluster, so if you ever do enable cephx you'll need to re-assign all the keys. And you might be in line for some other strangeness as well that I haven't foreseen down the line. In the meanwhile, I've forwarded things on to our full-time monitor guy (on vacation at the moment) to try and identify any possible issues that might have led to this ? I'm thinking maybe there's a conflict for users who are running without cephx (I haven't seen anybody without it for a long while) but do a bunch of entity creates or something similar. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Fri, Jul 18, 2014 at 3:37 PM, Lincoln Bryant <lincolnb at uchicago.edu> wrote: > Thanks Greg. Just for posterity, "ceph-kvstore-tool /var/lib/ceph/mon/store.db set auth last_committed ver 0" did the trick and we're back to HEALTH_OK. > > Cheers, > Lincoln Bryant > > On Jul 18, 2014, at 4:15 PM, Gregory Farnum wrote: > >> Hmm, this log is just leaving me with more questions. Could you tar up >> the "/var/lib/ceph/mon/store.db" (substitute actual mon store path as >> necessary) and upload it for me? (you can use ceph-post-file to put it >> on our servers if you prefer.) Just from the log I don't have a great >> idea of what's gone wrong, but you might find that >> ceph-kvstore-tool /var/lib/ceph/mon/store.db set auth last_committed ver 0 >> helps. (To be perfectly honest I'm just copying that from a similar >> report in the tracker at http://tracker.ceph.com/issues/8851, but >> that's the approach I was planning on.) >> >> Nothing has changed in the monitor that should have caused issues, but >> with two reports I'd like to at least see if we can do something to be >> a little more robust in the face of corruption! >> -Greg >> Software Engineer #42 @ http://inktank.com | http://ceph.com >> >> On Thu, Jul 17, 2014 at 1:39 PM, Lincoln Bryant <lincolnb at uchicago.edu> wrote: >>> Hi all, >>> >>> I tried restarting my mon today, but I find that it no longer starts. Whenever I try to fire up the mon, I get errors of this nature: >>> >>> -3> 2014-07-17 15:12:32.738510 7f25b0921780 10 mon.a at -1(probing).auth v1537 update_from_paxos >>> -2> 2014-07-17 15:12:32.738526 7f25b0921780 10 mon.a at -1(probing).auth v1537 update_from_paxos version 1537 keys ver 0 latest 0 >>> -1> 2014-07-17 15:12:32.738532 7f25b0921780 10 mon.a at -1(probing).auth v1537 update_from_paxos key server version 0 >>> 0> 2014-07-17 15:12:32.739836 7f25b0921780 -1 mon/AuthMonitor.cc: In function 'virtual void AuthMonitor::update_from_paxos(bool*)' thread 7f25b0921780 time 2014-07-17 15:12:32.738549 >>> mon/AuthMonitor.cc: 155: FAILED assert(ret == 0) >>> >>> After having a conversation with Greg in IRC, it seems that the disk state is corrupted. This seems to be CephX related, although we do not have CephX enabled on this cluster. >>> >>> At Greg's request, I've attached the logs in this mail to hopefully squirrel out what exactly is corrupted. I've set debug {mon,paxos,auth,keyvaluestore} to 20 in ceph.conf. >>> >>> I'm hoping to be able to recover -- unfortunately we've made the mistake of only deploying a single mon for this cluster, and there is some data I'd like to preserve. >>> >>> Thanks for any help, >>> Lincoln Bryant >>> >>> >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users at lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >