Re: Trying to rescue a lost quorum

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Feb 27, 2014 at 4:25 PM, Marc <mail@xxxxxxxxxx> wrote:
> Hi,
>
> I was handed a Ceph cluster that had just lost quorum due to 2/3 mons
> (b,c) running out of disk space (using up 15GB each). We were trying to
> rescue this cluster without service downtime. As such we freed up some
> space to keep mon b running a while longer, which succeeded, quorum
> restored (a,b), mon c remained offline. Even though we have freed up
> some space on mon c's disk also, that mon just won't start. It's log
> file does say
>
> ceph version 0.61.2 (fea782543a844bb277ae94d3391788b76c5bee60), process
> ceph-mon, pid 27846
>
> and thats all she wrote. Even when starting ceph-mon with -d mind you.
>
> So we had a cluster with 2/3 mons up and wanted to add another mon since
> it was only a matter of time til mon b failed again due to disk space.
>
> As such I added mon.g to the cluster, which took a long while to sync,
> but now reports running.
>
> Then mon.h got added for the same reason. mon.h fails to start much the
> same as mon.c does.
>
> Still that should leave us with 3/5 mons up. However running "ceph
> daemon mon.{g,h} mon_status" on the respective node also blocks. The
> only output we get from those are fault messages.
>
> Ok so now mon.g apparantly crashed:
>
> 2014-02-28 00:11:48.861263 7f4728042700 -1 mon/Monitor.cc: In function
> 'void Monitor::sync_timeout(entity_inst_t&)' thread 7f4728042700 time
> 2014-02-28 00:11:48.782305 mon/Monitor.cc: 1099: FAILED
> assert(sync_state == SYNC_STATE_CHUNKS)
>
> ... and now blocks trying to start much like c and h.
>
> Long story short: is it possible to add .61.9 mons to a cluster running
> .61.2 on the 2 alive mons and all the osds? I'm guessing this is the
> last shot at trying to rescue the cluster without downtime.

That should be fine, and is likely (though not guaranteed) to resolve
your sync issues -- although it's pretty unfortunate that you're that
far behind on the point releases; they fixed a whole lot of sync
issues and related things and you might need to upgrade the existing
monitors too in order to get the fixes you need... :/
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux