Re: Disaster recovery of monitor

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 2013-06-13 18:06, Gregory Farnum wrote:
On Thursday, June 13, 2013, wrote:

Hello,
We ran into a problem with our test cluster after adding monitors. It now seems that our main monitor doesn't want to start anymore. The logs are flooded with: 2013-06-13 11:41:05.316982 7f7689ca4780  7 mon.a@0(leader).osd e2809 update_from_paxos  applying incremental 2810 2013-06-13 11:41:05.317043 7f7689ca4780  1 mon.a@0(leader).osd e2809 e2809: 9 osds: 9 up, 9 in 2013-06-13 11:41:05.317064 7f7689ca4780  7 mon.a@0(leader).osd e2809 update_from_paxos  applying incremental 2810
Is this accurate? It's applying the *same* incremental I've and over again?

Yes, this is the current state:

2013-06-13 11:36:45.553793 7f5894e25700 7 mon.a@0(leader).osd e2809 update_from_paxos applying incremental 2810 2013-06-13 11:36:45.553846 7f5894e25700 1 mon.a@0(leader).osd e2809 e2809: 9 osds: 9 up, 9 in 2013-06-13 11:36:45.553869 7f5894e25700 7 mon.a@0(leader).osd e2809 update_from_paxos applying incremental 2810 2013-06-13 11:36:45.553922 7f5894e25700 1 mon.a@0(leader).osd e2809 e2809: 9 osds: 9 up, 9 in 2013-06-13 11:36:45.553950 7f5894e25700 7 mon.a@0(leader).osd e2809 update_from_paxos applying incremental 2810 2013-06-13 11:36:45.554002 7f5894e25700 1 mon.a@0(leader).osd e2809 e2809: 9 osds: 9 up, 9 in 2013-06-13 11:36:45.554025 7f5894e25700 7 mon.a@0(leader).osd e2809 update_from_paxos applying incremental 2810 2013-06-13 11:36:45.554076 7f5894e25700 1 mon.a@0(leader).osd e2809 e2809: 9 osds: 9 up, 9 in 2013-06-13 11:36:45.554098 7f5894e25700 7 mon.a@0(leader).osd e2809 update_from_paxos applying incremental 2810 2013-06-13 11:36:45.554154 7f5894e25700 1 mon.a@0(leader).osd e2809 e2809: 9 osds: 9 up, 9 in 2013-06-13 11:36:45.554177 7f5894e25700 7 mon.a@0(leader).osd e2809 update_from_paxos applying incremental 2810 2013-06-13 11:36:45.554228 7f5894e25700 1 mon.a@0(leader).osd e2809 e2809: 9 osds: 9 up, 9 in 2013-06-13 11:36:45.554251 7f5894e25700 7 mon.a@0(leader).osd e2809 update_from_paxos applying incremental 2810 2013-06-13 11:36:45.554302 7f5894e25700 1 mon.a@0(leader).osd e2809 e2809: 9 osds: 9 up, 9 in 2013-06-13 11:36:45.554325 7f5894e25700 7 mon.a@0(leader).osd e2809 update_from_paxos applying incremental 2810 2013-06-13 11:36:45.554376 7f5894e25700 1 mon.a@0(leader).osd e2809 e2809: 9 osds: 9 up, 9 in 2013-06-13 11:36:45.554406 7f5894e25700 7 mon.a@0(leader).osd e2809 update_from_paxos applying incremental 2810 2013-06-13 11:36:45.554459 7f5894e25700 1 mon.a@0(leader).osd e2809 e2809: 9 osds: 9 up, 9 in 2013-06-13 11:36:45.554482 7f5894e25700 7 mon.a@0(leader).osd e2809 update_from_paxos applying incremental 2810 2013-06-13 11:36:45.554532 7f5894e25700 1 mon.a@0(leader).osd e2809 e2809: 9 osds: 9 up, 9 in 2013-06-13 11:36:45.554555 7f5894e25700 7 mon.a@0(leader).osd e2809 update_from_paxos applying incremental 2810 2013-06-13 11:36:45.554606 7f5894e25700 1 mon.a@0(leader).osd e2809 e2809: 9 osds: 9 up, 9 in 2013-06-13 11:36:45.554629 7f5894e25700 7 mon.a@0(leader).osd e2809 update_from_paxos applying incremental 2810 2013-06-13 11:36:45.554682 7f5894e25700 1 mon.a@0(leader).osd e2809 e2809: 9 osds: 9 up, 9 in 2013-06-13 11:36:45.554705 7f5894e25700 7 mon.a@0(leader).osd e2809 update_from_paxos applying incremental 2810 2013-06-13 11:36:45.554755 7f5894e25700 1 mon.a@0(leader).osd e2809 e2809: 9 osds: 9 up, 9 in 2013-06-13 11:36:45.554778 7f5894e25700 7 mon.a@0(leader).osd e2809 update_from_paxos applying incremental 2810

and many many more. It seems as if the monitor is in a loop.


 

etc
When starting after a while we get the following error:
service ceph start mon.a
=== mon.a ===
Starting Ceph mon.a on xxxxx...
[22037]: (33) Numerical argument out of domain
failed: 'ulimit -n 8192;  /usr/bin/ceph-mon -i a --pid-file /var/run/ceph/mon.a.pid -c /etc/ceph/ceph.conf '
Starting ceph-create-keys on xxxx...
Is there are disaster recovery method for monitors? This is just a test environment so I don't really care about the data but if something like this happens on a production environment I would like to know how to get it back (if at all possible). We just upgraded to 0.61.3. Perhaps we ran into a bug. When adding the monitors we just followed this guide:
http://ceph.com/docs/next/rados/operations/add-or-rm-mons/ [1]
After adding the monitors we ran into problems and we tried to fix it with information we could find online and we started playing with monmap and I think this is where it went bad.
Started playing with the monmap? Please describe in more detail the
steps you took, and the monitors you had at each point.

I should have been more specific, sorry!

We ran commands like:

monmaptool --create --add a xxx.xxx.0.25:6789 --clobber --fsid f52ed31a-ca64-48b6-bd61-e2192998cd2f monmap

ceph-mon -i a --inject-monmap monmap

I think we also might have changed the fsid by accident because we also created monmaps without the --fsid parameter and injected these.

Are your other monitors working? If so it's easy enough to remove
this one, wipe it out, and add it back in. I'm curious about that
weird update loop, though, if you can help us look at that.
-Greg

We don't have any working monitors, that is the problem. Ofcourse, I can show you exactly what we did when adding the monitors. I can send you logfiles and the history of commands we ran.

I first thought that my colleague did something wrong but I followed the same procedure as stated on http://ceph.com/docs/next/rados/operations/add-or-rm-mons/. We were able to recover from that once but the second time we were not. So I think the monitor is stuck in some weird state or some file in /var/lib/ceph/mon/ceph-a/store.db/ got corrupted. We also tried to use the ceph-monstore-tool command, but this didn't work either:

ceph-monstore-tool --mon-store-path . --key mon_sync:latest_monmap get-val --out /tmp/mon_sync.monmap terminate called after throwing an instance of 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<boost::program_options::unknown_option> >'
  what():  unknown option key
Aborted (core dumped)

Yep, we really messed it up :) I have no clue about recovering data as there also isn't any documentation on it. Ofcourse we could just wipe and start over but I'd really like to know if we can fix this, as a good excercise.

Cheers,

Peter

 
Links:
------
[1] http://ceph.com/docs/next/rados/operations/add-or-rm-mons/
[2] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[3] http://ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux