Mon won't start, possibly due to corrupt disk?

greg@xxxxxxxxxxx (Gregory Farnum) · Tue, 22 Jul 2014 17:06:10 -0700

On Thu, Jul 17, 2014 at 1:39 PM, Lincoln Bryant <lincolnb at uchicago.edu> wrote:
> Hi all,
>
> I tried restarting my mon today, but I find that it no longer starts. Whenever I try to fire up the mon, I get errors of this nature:
>
>    -3> 2014-07-17 15:12:32.738510 7f25b0921780 10 mon.a at -1(probing).auth v1537 update_from_paxos
>    -2> 2014-07-17 15:12:32.738526 7f25b0921780 10 mon.a at -1(probing).auth v1537 update_from_paxos version 1537 keys ver 0 latest 0
>    -1> 2014-07-17 15:12:32.738532 7f25b0921780 10 mon.a at -1(probing).auth v1537 update_from_paxos key server version 0
>     0> 2014-07-17 15:12:32.739836 7f25b0921780 -1 mon/AuthMonitor.cc: In function 'virtual void AuthMonitor::update_from_paxos(bool*)' thread 7f25b0921780 time 2014-07-17 15:12:32.738549
> mon/AuthMonitor.cc: 155: FAILED assert(ret == 0)
>
> After having a conversation with Greg in IRC, it seems that the disk state is corrupted. This seems to be CephX related, although we do not have CephX enabled on this cluster.
>
> At Greg's request, I've attached the logs in this mail to hopefully squirrel out what exactly is corrupted. I've set debug {mon,paxos,auth,keyvaluestore} to 20 in ceph.conf.
>
> I'm hoping to be able to recover -- unfortunately we've made the mistake of only deploying a single mon for this cluster, and there is some data I'd like to preserve.

This turned out to be a real bug, impacting anybody who was running
without generating cephx keys (simply turning off cephx would not have
done it). It was tracked at http://tracker.ceph.com/issues/8851 and is
now resolved in the tree; the fix should be in the next Firefly point
release.

Thanks very much, everybody!
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com