Hello,
We ran into a problem with our test cluster after adding monitors. It now seems that our main monitor doesn't want to start anymore. The logs are flooded with:
2013-06-13 11:41:05.316982 7f7689ca4780 7 mon.a@0(leader).osd e2809 update_from_paxos applying incremental 2810
2013-06-13 11:41:05.317043 7f7689ca4780 1 mon.a@0(leader).osd e2809 e2809: 9 osds: 9 up, 9 in
2013-06-13 11:41:05.317064 7f7689ca4780 7 mon.a@0(leader).osd e2809 update_from_paxos applying incremental 2810
Is this accurate? It's applying the *same* incremental I've and over again?
etc
When starting after a while we get the following error:
service ceph start mon.a
=== mon.a ===
Starting Ceph mon.a on xxxxx...
[22037]: (33) Numerical argument out of domain
failed: 'ulimit -n 8192; /usr/bin/ceph-mon -i a --pid-file /var/run/ceph/mon.a.pid -c /etc/ceph/ceph.conf '
Starting ceph-create-keys on xxxx...
Is there are disaster recovery method for monitors? This is just a test environment so I don't really care about the data but if something like this happens on a production environment I would like to know how to get it back (if at all possible).
We just upgraded to 0.61.3. Perhaps we ran into a bug. When adding the monitors we just followed this guide:
http://ceph.com/docs/next/rados/operations/add-or-rm-mons/
After adding the monitors we ran into problems and we tried to fix it with information we could find online and we started playing with monmap and I think this is where it went bad.
Started playing with the monmap? Please describe in more detail the steps you took, and the monitors you had at each point.
Are your other monitors working? If so it's easy enough to remove this one, wipe it out, and add it back in. I'm curious about that weird update loop, though, if you can help us look at that.
-Greg
We are running ceph version 0.61.3 (92b1e398576d55df8e5888dd1a9545ed3fd99532)
/etc/ceph/ceph.conf is pretty simple for the monitor:
[global]
auth supported = none
auth cluster required = none
auth service required = none
auth client required = none
public network = xxx.xxx.0.0/24
cluster network = xxx.xxx.0.0/24
mon initial members = xxxxx
[osd]
osd journal size = 1000
[mds.a]
host = xxxxx
devs = /dev/sdb
mds data = "">
[mon.a]
host = xxxxx
mon addr = xxx.xxx.0.25:6789
mon data = "">
etc
Thanks for looking and if you need more info let me know.
Cheers,
Peter
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Software Engineer #42 @ http://inktank.com | http://ceph.com
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com