Re: Disaster recovery of monitor

peter@xxxxxxxxx · Thu, 13 Jun 2013 18:25:26 +0200

On 2013-06-13 18:06, Gregory Farnum wrote:
On Thursday, June 13, 2013, wrote:

Hello,
We ran into a problem with our test cluster after adding monitors. It 
now seems that our main monitor doesn't want to start anymore. The 
logs are flooded with:
2013-06-13 11:41:05.316982 7f7689ca4780  7 mon.a@0(leader).osd e2809 
update_from_paxos  applying incremental 2810
2013-06-13 11:41:05.317043 7f7689ca4780  1 mon.a@0(leader).osd e2809 
e2809: 9 osds: 9 up, 9 in
2013-06-13 11:41:05.317064 7f7689ca4780  7 mon.a@0(leader).osd e2809 
update_from_paxos  applying incremental 2810
Is this accurate? It's applying the *same* incremental I've and over 
again?

Yes, this is the current state:

2013-06-13 11:36:45.553793 7f5894e25700  7 mon.a@0(leader).osd e2809 
update_from_paxos  applying incremental 2810
2013-06-13 11:36:45.553846 7f5894e25700  1 mon.a@0(leader).osd e2809 
e2809: 9 osds: 9 up, 9 in
2013-06-13 11:36:45.553869 7f5894e25700  7 mon.a@0(leader).osd e2809 
update_from_paxos  applying incremental 2810
2013-06-13 11:36:45.553922 7f5894e25700  1 mon.a@0(leader).osd e2809 
e2809: 9 osds: 9 up, 9 in
2013-06-13 11:36:45.553950 7f5894e25700  7 mon.a@0(leader).osd e2809 
update_from_paxos  applying incremental 2810
2013-06-13 11:36:45.554002 7f5894e25700  1 mon.a@0(leader).osd e2809 
e2809: 9 osds: 9 up, 9 in
2013-06-13 11:36:45.554025 7f5894e25700  7 mon.a@0(leader).osd e2809 
update_from_paxos  applying incremental 2810
2013-06-13 11:36:45.554076 7f5894e25700  1 mon.a@0(leader).osd e2809 
e2809: 9 osds: 9 up, 9 in
2013-06-13 11:36:45.554098 7f5894e25700  7 mon.a@0(leader).osd e2809 
update_from_paxos  applying incremental 2810
2013-06-13 11:36:45.554154 7f5894e25700  1 mon.a@0(leader).osd e2809 
e2809: 9 osds: 9 up, 9 in
2013-06-13 11:36:45.554177 7f5894e25700  7 mon.a@0(leader).osd e2809 
update_from_paxos  applying incremental 2810
2013-06-13 11:36:45.554228 7f5894e25700  1 mon.a@0(leader).osd e2809 
e2809: 9 osds: 9 up, 9 in
2013-06-13 11:36:45.554251 7f5894e25700  7 mon.a@0(leader).osd e2809 
update_from_paxos  applying incremental 2810
2013-06-13 11:36:45.554302 7f5894e25700  1 mon.a@0(leader).osd e2809 
e2809: 9 osds: 9 up, 9 in
2013-06-13 11:36:45.554325 7f5894e25700  7 mon.a@0(leader).osd e2809 
update_from_paxos  applying incremental 2810
2013-06-13 11:36:45.554376 7f5894e25700  1 mon.a@0(leader).osd e2809 
e2809: 9 osds: 9 up, 9 in
2013-06-13 11:36:45.554406 7f5894e25700  7 mon.a@0(leader).osd e2809 
update_from_paxos  applying incremental 2810
2013-06-13 11:36:45.554459 7f5894e25700  1 mon.a@0(leader).osd e2809 
e2809: 9 osds: 9 up, 9 in
2013-06-13 11:36:45.554482 7f5894e25700  7 mon.a@0(leader).osd e2809 
update_from_paxos  applying incremental 2810
2013-06-13 11:36:45.554532 7f5894e25700  1 mon.a@0(leader).osd e2809 
e2809: 9 osds: 9 up, 9 in
2013-06-13 11:36:45.554555 7f5894e25700  7 mon.a@0(leader).osd e2809 
update_from_paxos  applying incremental 2810
2013-06-13 11:36:45.554606 7f5894e25700  1 mon.a@0(leader).osd e2809 
e2809: 9 osds: 9 up, 9 in
2013-06-13 11:36:45.554629 7f5894e25700  7 mon.a@0(leader).osd e2809 
update_from_paxos  applying incremental 2810
2013-06-13 11:36:45.554682 7f5894e25700  1 mon.a@0(leader).osd e2809 
e2809: 9 osds: 9 up, 9 in
2013-06-13 11:36:45.554705 7f5894e25700  7 mon.a@0(leader).osd e2809 
update_from_paxos  applying incremental 2810
2013-06-13 11:36:45.554755 7f5894e25700  1 mon.a@0(leader).osd e2809 
e2809: 9 osds: 9 up, 9 in
2013-06-13 11:36:45.554778 7f5894e25700  7 mon.a@0(leader).osd e2809 
update_from_paxos  applying incremental 2810

and many many more. It seems as if the monitor is in a loop.

etc
When starting after a while we get the following error:
service ceph start mon.a
=== mon.a ===
Starting Ceph mon.a on xxxxx...
[22037]: (33) Numerical argument out of domain
failed: 'ulimit -n 8192;  /usr/bin/ceph-mon -i a --pid-file 
/var/run/ceph/mon.a.pid -c /etc/ceph/ceph.conf '
Starting ceph-create-keys on xxxx...
Is there are disaster recovery method for monitors? This is just a 
test environment so I don't really care about the data but if 
something like this happens on a production environment I would like 
to know how to get it back (if at all possible).
We just upgraded to 0.61.3. Perhaps we ran into a bug. When adding 
the monitors we just followed this guide:
http://ceph.com/docs/next/rados/operations/add-or-rm-mons/ [1]
After adding the monitors we ran into problems and we tried to fix it 
with information we could find online and we started playing with 
monmap and I think this is where it went bad.
Started playing with the monmap? Please describe in more detail the
steps you took, and the monitors you had at each point.

I should have been more specific, sorry!

We ran commands like:

monmaptool  --create --add a xxx.xxx.0.25:6789 --clobber --fsid 
f52ed31a-ca64-48b6-bd61-e2192998cd2f monmap

ceph-mon -i a --inject-monmap monmap

I think we also might have changed the fsid by accident because we also 
created monmaps without the --fsid parameter and injected these.

Are your other monitors working? If so it's easy enough to remove
this one, wipe it out, and add it back in. I'm curious about that
weird update loop, though, if you can help us look at that.
-Greg

We don't have any working monitors, that is the problem. Ofcourse, I 
can show you exactly what we did when adding the monitors. I can send 
you logfiles and the history of commands we ran.

I first thought that my colleague did something wrong but I followed 
the same procedure as stated on 
http://ceph.com/docs/next/rados/operations/add-or-rm-mons/. We were able 
to recover from that once but the second time we were not. So I think 
the monitor is stuck in some weird state or some file in 
/var/lib/ceph/mon/ceph-a/store.db/ got corrupted. We also tried to use 
the ceph-monstore-tool command, but this didn't work either:

ceph-monstore-tool --mon-store-path . --key mon_sync:latest_monmap 
get-val --out /tmp/mon_sync.monmap
terminate called after throwing an instance of 
'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<boost::program_options::unknown_option> 
>'
  what():  unknown option key
Aborted (core dumped)

Yep, we really messed it up :) I have no clue about recovering data as 
there also isn't any documentation on it. Ofcourse we could just wipe 
and start over but I'd really like to know if we can fix this, as a good 
excercise.

Cheers,

Peter

Links:
------
[1] http://ceph.com/docs/next/rados/operations/add-or-rm-mons/
[2] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[3] http://ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com