Mon won't start, possibly due to corrupt disk?

greg@xxxxxxxxxxx (Gregory Farnum) · Fri, 18 Jul 2014 15:43:45 -0700

Keep in mind that this has thrown out all the auth info in your
cluster, so if you ever do enable cephx you'll need to re-assign all
the keys. And you might be in line for some other strangeness as well
that I haven't foreseen down the line.

In the meanwhile, I've forwarded things on to our full-time monitor
guy (on vacation at the moment) to try and identify any possible
issues that might have led to this ? I'm thinking maybe there's a
conflict for users who are running without cephx (I haven't seen
anybody without it for a long while) but do a bunch of entity creates
or something similar.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com

On Fri, Jul 18, 2014 at 3:37 PM, Lincoln Bryant <lincolnb at uchicago.edu> wrote:
> Thanks Greg. Just for posterity, "ceph-kvstore-tool /var/lib/ceph/mon/store.db set auth last_committed ver 0" did the trick and we're back to HEALTH_OK.
>
> Cheers,
> Lincoln Bryant
>
> On Jul 18, 2014, at 4:15 PM, Gregory Farnum wrote:
>
>> Hmm, this log is just leaving me with more questions. Could you tar up
>> the "/var/lib/ceph/mon/store.db" (substitute actual mon store path as
>> necessary) and upload it for me? (you can use ceph-post-file to put it
>> on our servers if you prefer.) Just from the log I don't have a great
>> idea of what's gone wrong, but you might find that
>> ceph-kvstore-tool /var/lib/ceph/mon/store.db set auth last_committed ver 0
>> helps. (To be perfectly honest I'm just copying that from a similar
>> report in the tracker at http://tracker.ceph.com/issues/8851, but
>> that's the approach I was planning on.)
>>
>> Nothing has changed in the monitor that should have caused issues, but
>> with two reports I'd like to at least see if we can do something to be
>> a little more robust in the face of corruption!
>> -Greg
>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>>
>> On Thu, Jul 17, 2014 at 1:39 PM, Lincoln Bryant <lincolnb at uchicago.edu> wrote:
>>> Hi all,
>>>
>>> I tried restarting my mon today, but I find that it no longer starts. Whenever I try to fire up the mon, I get errors of this nature:
>>>
>>>   -3> 2014-07-17 15:12:32.738510 7f25b0921780 10 mon.a at -1(probing).auth v1537 update_from_paxos
>>>   -2> 2014-07-17 15:12:32.738526 7f25b0921780 10 mon.a at -1(probing).auth v1537 update_from_paxos version 1537 keys ver 0 latest 0
>>>   -1> 2014-07-17 15:12:32.738532 7f25b0921780 10 mon.a at -1(probing).auth v1537 update_from_paxos key server version 0
>>>    0> 2014-07-17 15:12:32.739836 7f25b0921780 -1 mon/AuthMonitor.cc: In function 'virtual void AuthMonitor::update_from_paxos(bool*)' thread 7f25b0921780 time 2014-07-17 15:12:32.738549
>>> mon/AuthMonitor.cc: 155: FAILED assert(ret == 0)
>>>
>>> After having a conversation with Greg in IRC, it seems that the disk state is corrupted. This seems to be CephX related, although we do not have CephX enabled on this cluster.
>>>
>>> At Greg's request, I've attached the logs in this mail to hopefully squirrel out what exactly is corrupted. I've set debug {mon,paxos,auth,keyvaluestore} to 20 in ceph.conf.
>>>
>>> I'm hoping to be able to recover -- unfortunately we've made the mistake of only deploying a single mon for this cluster, and there is some data I'd like to preserve.
>>>
>>> Thanks for any help,
>>> Lincoln Bryant
>>>
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users at lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>