Trying to rescue a lost quorum

Marc <mail@xxxxxxxxxx> · Fri, 28 Feb 2014 01:25:16 +0100

Hi,

I was handed a Ceph cluster that had just lost quorum due to 2/3 mons
(b,c) running out of disk space (using up 15GB each). We were trying to
rescue this cluster without service downtime. As such we freed up some
space to keep mon b running a while longer, which succeeded, quorum
restored (a,b), mon c remained offline. Even though we have freed up
some space on mon c's disk also, that mon just won't start. It's log
file does say

ceph version 0.61.2 (fea782543a844bb277ae94d3391788b76c5bee60), process
ceph-mon, pid 27846

and thats all she wrote. Even when starting ceph-mon with -d mind you.

So we had a cluster with 2/3 mons up and wanted to add another mon since
it was only a matter of time til mon b failed again due to disk space.

As such I added mon.g to the cluster, which took a long while to sync,
but now reports running.

Then mon.h got added for the same reason. mon.h fails to start much the
same as mon.c does.

Still that should leave us with 3/5 mons up. However running "ceph
daemon mon.{g,h} mon_status" on the respective node also blocks. The
only output we get from those are fault messages.

Ok so now mon.g apparantly crashed:

2014-02-28 00:11:48.861263 7f4728042700 -1 mon/Monitor.cc: In function
'void Monitor::sync_timeout(entity_inst_t&)' thread 7f4728042700 time
2014-02-28 00:11:48.782305 mon/Monitor.cc: 1099: FAILED
assert(sync_state == SYNC_STATE_CHUNKS)

... and now blocks trying to start much like c and h.

Long story short: is it possible to add .61.9 mons to a cluster running
.61.2 on the 2 alive mons and all the osds? I'm guessing this is the
last shot at trying to rescue the cluster without downtime.

KR,
Marc
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com