On 06/14/2013 02:39 PM, peter@xxxxxxxxx wrote:
On 2013-06-13 20:10, peter@xxxxxxxxx wrote:
On 2013-06-13 18:57, Joao Eduardo Luis wrote:
On 06/13/2013 05:25 PM, peter@xxxxxxxxx wrote:
On 2013-06-13 18:06, Gregory Farnum wrote:
On Thursday, June 13, 2013, wrote:
Hello,
We ran into a problem with our test cluster after adding monitors. It
now seems that our main monitor doesn't want to start anymore. The
logs are flooded with:
2013-06-13 11:41:05.316982 7f7689ca4780 7 mon.a@0(leader).osd e2809
update_from_paxos applying incremental 2810
2013-06-13 11:41:05.317043 7f7689ca4780 1 mon.a@0(leader).osd e2809
e2809: 9 osds: 9 up, 9 in
2013-06-13 11:41:05.317064 7f7689ca4780 7 mon.a@0(leader).osd e2809
update_from_paxos applying incremental 2810
Is this accurate? It's applying the *same* incremental I've and over
again?
Yes, this is the current state:
Peter,
Can you point me to the full log of the monitor caught in this
apparent loop?
-Joao
Hi Joao,
Here it is:
http://www.2force.nl/ceph/ceph-mon.a.log.gz
Thanks,
Peter
Hi Joao,
Did you happen to figure out what is going on? If you need more log
files let me know.
Peter,
You can find all the updates on #5343 [1].
It is my understanding that you are running a test cluster; is this
correct? If so, our suggestion is to start your monitor fresh. We've
been able to figure out all the caused for this issue (thanks for your
help!):
- Injecting a monmap with a wrong fsid was the main culprit. Given you
are on a version suffering from a bug that won't kill the monitor if
some sanity checks fail when the monitor is started, the monitor was
started even though said fsid mismatch was present. A fix for that will
be hitting 0.61.4 soon, and has already hit master a few days back.
- There was a bug in OSDMonitor::update_from_paxos() that would ignore
the return from OSDMap::apply_incremental(), thus leading to the
infinite loop in case the incremental failed to be applied. That should
go into master soon.
However, with regard to getting the monitor running back again, there's
little we can do at the moment. We don't believe the fix to correct the
incremental's fsid is necessary, as it should never happen again once
the patches are in and shouldn't even have happened in the first place
were the fsid in the injected monmap to be correct. So, if this is
indeed a test cluster, it would be better to just start off fresh;
otherwise, let me know and we may be able to put a quick and dirty fix
to get your cluster back again.
Thanks!
-Joao
[1] - http://tracker.ceph.com/issues/5343
--
Joao Eduardo Luis
Software Engineer | http://inktank.com | http://ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com