On Thu, Jul 17, 2014 at 1:39 PM, Lincoln Bryant <lincolnb at uchicago.edu> wrote: > Hi all, > > I tried restarting my mon today, but I find that it no longer starts. Whenever I try to fire up the mon, I get errors of this nature: > > -3> 2014-07-17 15:12:32.738510 7f25b0921780 10 mon.a at -1(probing).auth v1537 update_from_paxos > -2> 2014-07-17 15:12:32.738526 7f25b0921780 10 mon.a at -1(probing).auth v1537 update_from_paxos version 1537 keys ver 0 latest 0 > -1> 2014-07-17 15:12:32.738532 7f25b0921780 10 mon.a at -1(probing).auth v1537 update_from_paxos key server version 0 > 0> 2014-07-17 15:12:32.739836 7f25b0921780 -1 mon/AuthMonitor.cc: In function 'virtual void AuthMonitor::update_from_paxos(bool*)' thread 7f25b0921780 time 2014-07-17 15:12:32.738549 > mon/AuthMonitor.cc: 155: FAILED assert(ret == 0) > > After having a conversation with Greg in IRC, it seems that the disk state is corrupted. This seems to be CephX related, although we do not have CephX enabled on this cluster. > > At Greg's request, I've attached the logs in this mail to hopefully squirrel out what exactly is corrupted. I've set debug {mon,paxos,auth,keyvaluestore} to 20 in ceph.conf. > > I'm hoping to be able to recover -- unfortunately we've made the mistake of only deploying a single mon for this cluster, and there is some data I'd like to preserve. This turned out to be a real bug, impacting anybody who was running without generating cephx keys (simply turning off cephx would not have done it). It was tracked at http://tracker.ceph.com/issues/8851 and is now resolved in the tree; the fix should be in the next Firefly point release. Thanks very much, everybody! -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com