On 2013-06-13 18:06, Gregory Farnum wrote:
On Thursday, June 13, 2013, wrote:
Hello,
We ran into a problem with our test cluster after adding monitors. It
now seems that our main monitor doesn't want to start anymore. The
logs are flooded with:
2013-06-13 11:41:05.316982 7f7689ca4780 7 mon.a@0(leader).osd e2809
update_from_paxos applying incremental 2810
2013-06-13 11:41:05.317043 7f7689ca4780 1 mon.a@0(leader).osd e2809
e2809: 9 osds: 9 up, 9 in
2013-06-13 11:41:05.317064 7f7689ca4780 7 mon.a@0(leader).osd e2809
update_from_paxos applying incremental 2810
Is this accurate? It's applying the *same* incremental I've and over
again?
Yes, this is the current state:
2013-06-13 11:36:45.553793 7f5894e25700 7 mon.a@0(leader).osd e2809
update_from_paxos applying incremental 2810
2013-06-13 11:36:45.553846 7f5894e25700 1 mon.a@0(leader).osd e2809
e2809: 9 osds: 9 up, 9 in
2013-06-13 11:36:45.553869 7f5894e25700 7 mon.a@0(leader).osd e2809
update_from_paxos applying incremental 2810
2013-06-13 11:36:45.553922 7f5894e25700 1 mon.a@0(leader).osd e2809
e2809: 9 osds: 9 up, 9 in
2013-06-13 11:36:45.553950 7f5894e25700 7 mon.a@0(leader).osd e2809
update_from_paxos applying incremental 2810
2013-06-13 11:36:45.554002 7f5894e25700 1 mon.a@0(leader).osd e2809
e2809: 9 osds: 9 up, 9 in
2013-06-13 11:36:45.554025 7f5894e25700 7 mon.a@0(leader).osd e2809
update_from_paxos applying incremental 2810
2013-06-13 11:36:45.554076 7f5894e25700 1 mon.a@0(leader).osd e2809
e2809: 9 osds: 9 up, 9 in
2013-06-13 11:36:45.554098 7f5894e25700 7 mon.a@0(leader).osd e2809
update_from_paxos applying incremental 2810
2013-06-13 11:36:45.554154 7f5894e25700 1 mon.a@0(leader).osd e2809
e2809: 9 osds: 9 up, 9 in
2013-06-13 11:36:45.554177 7f5894e25700 7 mon.a@0(leader).osd e2809
update_from_paxos applying incremental 2810
2013-06-13 11:36:45.554228 7f5894e25700 1 mon.a@0(leader).osd e2809
e2809: 9 osds: 9 up, 9 in
2013-06-13 11:36:45.554251 7f5894e25700 7 mon.a@0(leader).osd e2809
update_from_paxos applying incremental 2810
2013-06-13 11:36:45.554302 7f5894e25700 1 mon.a@0(leader).osd e2809
e2809: 9 osds: 9 up, 9 in
2013-06-13 11:36:45.554325 7f5894e25700 7 mon.a@0(leader).osd e2809
update_from_paxos applying incremental 2810
2013-06-13 11:36:45.554376 7f5894e25700 1 mon.a@0(leader).osd e2809
e2809: 9 osds: 9 up, 9 in
2013-06-13 11:36:45.554406 7f5894e25700 7 mon.a@0(leader).osd e2809
update_from_paxos applying incremental 2810
2013-06-13 11:36:45.554459 7f5894e25700 1 mon.a@0(leader).osd e2809
e2809: 9 osds: 9 up, 9 in
2013-06-13 11:36:45.554482 7f5894e25700 7 mon.a@0(leader).osd e2809
update_from_paxos applying incremental 2810
2013-06-13 11:36:45.554532 7f5894e25700 1 mon.a@0(leader).osd e2809
e2809: 9 osds: 9 up, 9 in
2013-06-13 11:36:45.554555 7f5894e25700 7 mon.a@0(leader).osd e2809
update_from_paxos applying incremental 2810
2013-06-13 11:36:45.554606 7f5894e25700 1 mon.a@0(leader).osd e2809
e2809: 9 osds: 9 up, 9 in
2013-06-13 11:36:45.554629 7f5894e25700 7 mon.a@0(leader).osd e2809
update_from_paxos applying incremental 2810
2013-06-13 11:36:45.554682 7f5894e25700 1 mon.a@0(leader).osd e2809
e2809: 9 osds: 9 up, 9 in
2013-06-13 11:36:45.554705 7f5894e25700 7 mon.a@0(leader).osd e2809
update_from_paxos applying incremental 2810
2013-06-13 11:36:45.554755 7f5894e25700 1 mon.a@0(leader).osd e2809
e2809: 9 osds: 9 up, 9 in
2013-06-13 11:36:45.554778 7f5894e25700 7 mon.a@0(leader).osd e2809
update_from_paxos applying incremental 2810
and many many more. It seems as if the monitor is in a loop.
etc
When starting after a while we get the following error:
service ceph start mon.a
=== mon.a ===
Starting Ceph mon.a on xxxxx...
[22037]: (33) Numerical argument out of domain
failed: 'ulimit -n 8192; /usr/bin/ceph-mon -i a --pid-file
/var/run/ceph/mon.a.pid -c /etc/ceph/ceph.conf '
Starting ceph-create-keys on xxxx...
Is there are disaster recovery method for monitors? This is just a
test environment so I don't really care about the data but if
something like this happens on a production environment I would like
to know how to get it back (if at all possible).
We just upgraded to 0.61.3. Perhaps we ran into a bug. When adding
the monitors we just followed this guide:
http://ceph.com/docs/next/rados/operations/add-or-rm-mons/ [1]
After adding the monitors we ran into problems and we tried to fix it
with information we could find online and we started playing with
monmap and I think this is where it went bad.
Started playing with the monmap? Please describe in more detail the
steps you took, and the monitors you had at each point.
I should have been more specific, sorry!
We ran commands like:
monmaptool --create --add a xxx.xxx.0.25:6789 --clobber --fsid
f52ed31a-ca64-48b6-bd61-e2192998cd2f monmap
ceph-mon -i a --inject-monmap monmap
I think we also might have changed the fsid by accident because we also
created monmaps without the --fsid parameter and injected these.
Are your other monitors working? If so it's easy enough to remove
this one, wipe it out, and add it back in. I'm curious about that
weird update loop, though, if you can help us look at that.
-Greg
We don't have any working monitors, that is the problem. Ofcourse, I
can show you exactly what we did when adding the monitors. I can send
you logfiles and the history of commands we ran.
I first thought that my colleague did something wrong but I followed
the same procedure as stated on
http://ceph.com/docs/next/rados/operations/add-or-rm-mons/. We were able
to recover from that once but the second time we were not. So I think
the monitor is stuck in some weird state or some file in
/var/lib/ceph/mon/ceph-a/store.db/ got corrupted. We also tried to use
the ceph-monstore-tool command, but this didn't work either:
ceph-monstore-tool --mon-store-path . --key mon_sync:latest_monmap
get-val --out /tmp/mon_sync.monmap
terminate called after throwing an instance of
'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<boost::program_options::unknown_option>
>'
what(): unknown option key
Aborted (core dumped)
Yep, we really messed it up :) I have no clue about recovering data as
there also isn't any documentation on it. Ofcourse we could just wipe
and start over but I'd really like to know if we can fix this, as a good
excercise.
Cheers,
Peter
Links:
------
[1] http://ceph.com/docs/next/rados/operations/add-or-rm-mons/
[2] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[3] http://ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com