Mon won't start, possibly due to corrupt disk?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Thanks Greg. Just for posterity, "ceph-kvstore-tool /var/lib/ceph/mon/store.db set auth last_committed ver 0" did the trick and we're back to HEALTH_OK.

Cheers,
Lincoln Bryant

On Jul 18, 2014, at 4:15 PM, Gregory Farnum wrote:

> Hmm, this log is just leaving me with more questions. Could you tar up
> the "/var/lib/ceph/mon/store.db" (substitute actual mon store path as
> necessary) and upload it for me? (you can use ceph-post-file to put it
> on our servers if you prefer.) Just from the log I don't have a great
> idea of what's gone wrong, but you might find that
> ceph-kvstore-tool /var/lib/ceph/mon/store.db set auth last_committed ver 0
> helps. (To be perfectly honest I'm just copying that from a similar
> report in the tracker at http://tracker.ceph.com/issues/8851, but
> that's the approach I was planning on.)
> 
> Nothing has changed in the monitor that should have caused issues, but
> with two reports I'd like to at least see if we can do something to be
> a little more robust in the face of corruption!
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
> 
> On Thu, Jul 17, 2014 at 1:39 PM, Lincoln Bryant <lincolnb at uchicago.edu> wrote:
>> Hi all,
>> 
>> I tried restarting my mon today, but I find that it no longer starts. Whenever I try to fire up the mon, I get errors of this nature:
>> 
>>   -3> 2014-07-17 15:12:32.738510 7f25b0921780 10 mon.a at -1(probing).auth v1537 update_from_paxos
>>   -2> 2014-07-17 15:12:32.738526 7f25b0921780 10 mon.a at -1(probing).auth v1537 update_from_paxos version 1537 keys ver 0 latest 0
>>   -1> 2014-07-17 15:12:32.738532 7f25b0921780 10 mon.a at -1(probing).auth v1537 update_from_paxos key server version 0
>>    0> 2014-07-17 15:12:32.739836 7f25b0921780 -1 mon/AuthMonitor.cc: In function 'virtual void AuthMonitor::update_from_paxos(bool*)' thread 7f25b0921780 time 2014-07-17 15:12:32.738549
>> mon/AuthMonitor.cc: 155: FAILED assert(ret == 0)
>> 
>> After having a conversation with Greg in IRC, it seems that the disk state is corrupted. This seems to be CephX related, although we do not have CephX enabled on this cluster.
>> 
>> At Greg's request, I've attached the logs in this mail to hopefully squirrel out what exactly is corrupted. I've set debug {mon,paxos,auth,keyvaluestore} to 20 in ceph.conf.
>> 
>> I'm hoping to be able to recover -- unfortunately we've made the mistake of only deploying a single mon for this cluster, and there is some data I'd like to preserve.
>> 
>> Thanks for any help,
>> Lincoln Bryant
>> 
>> 
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users at lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux