So our datacenter lost power and 2/3 of our monitors died with FS corruption. I tried fixing it but it looks like the store.db didn't make it.
--
I copied the working journal via
sudo mv /var/lib/ceph/mon/ceph-$(hostname){,.BAK}
sudo ceph-mon -i {mon-id} --mkfs --monmap {tmp}/{map-filename} --keyring {tmp}/{key-filename}
ceph-mon -i `hostname` --extract-monmap /tmp/monmap
ceph-mon -i {mon-id} --inject-monmap {map-path}
and for a brief moment i had a quorum but any ceph cli commands would result in cephx errors. Now the two failed monitors have elected a quorum and the monitor that was working keeps getting kicked out of the cluster::
'''
{
"election_epoch": 402,
"quorum": [
0,
1
],
"quorum_names": [
"kh11-8",
"kh12-8"
],
"quorum_leader_name": "kh11-8",
"monmap": {
"epoch": 1,
"fsid": "a6ae50db-5c71-4ef8-885e-8137c7793da8",
"modified": "0.000000",
"created": "0.000000",
"mons": [
{
"rank": 0,
"name": "kh11-8",
"addr": "10.64.64.134:6789\/0"
},
{
"rank": 1,
"name": "kh12-8",
"addr": "10.64.64.143:6789\/0"
},
{
"rank": 2,
"name": "kh13-8",
"addr": "10.64.64.151:6789\/0"
}
]
}
}
'''
At this point I am not sure what to do as any ceph commands return cephx errors and I can't seem to verify if the new "quorum" is actually valid
any way to regenerate a cephx authentication key or recover it with hardware access to the nodes or any advice on how to recover from what seems to be complete monitor failure?
- Sean: I wrote this. -
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com