Re: Disaster recovery of monitor

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 06/14/2013 02:39 PM, peter@xxxxxxxxx wrote:
On 2013-06-13 20:10, peter@xxxxxxxxx wrote:
On 2013-06-13 18:57, Joao Eduardo Luis wrote:
On 06/13/2013 05:25 PM, peter@xxxxxxxxx wrote:
On 2013-06-13 18:06, Gregory Farnum wrote:
On Thursday, June 13, 2013, wrote:

Hello,
We ran into a problem with our test cluster after adding monitors. It
now seems that our main monitor doesn't want to start anymore. The
logs are flooded with:
2013-06-13 11:41:05.316982 7f7689ca4780  7 mon.a@0(leader).osd e2809
update_from_paxos  applying incremental 2810
2013-06-13 11:41:05.317043 7f7689ca4780  1 mon.a@0(leader).osd e2809
e2809: 9 osds: 9 up, 9 in
2013-06-13 11:41:05.317064 7f7689ca4780  7 mon.a@0(leader).osd e2809
update_from_paxos  applying incremental 2810
Is this accurate? It's applying the *same* incremental I've and over
again?
Yes, this is the current state:
Peter,
Can you point me to the full log of the monitor caught in this
apparent loop?
-Joao


Hi Joao,

Here it is:

http://www.2force.nl/ceph/ceph-mon.a.log.gz

Thanks,

Peter


Hi Joao,

Did you happen to figure out what is going on? If you need more log
files let me know.

Peter,

You can find all the updates on #5343 [1].

It is my understanding that you are running a test cluster; is this correct? If so, our suggestion is to start your monitor fresh. We've been able to figure out all the caused for this issue (thanks for your help!):

- Injecting a monmap with a wrong fsid was the main culprit. Given you are on a version suffering from a bug that won't kill the monitor if some sanity checks fail when the monitor is started, the monitor was started even though said fsid mismatch was present. A fix for that will be hitting 0.61.4 soon, and has already hit master a few days back.

- There was a bug in OSDMonitor::update_from_paxos() that would ignore the return from OSDMap::apply_incremental(), thus leading to the infinite loop in case the incremental failed to be applied. That should go into master soon.


However, with regard to getting the monitor running back again, there's little we can do at the moment. We don't believe the fix to correct the incremental's fsid is necessary, as it should never happen again once the patches are in and shouldn't even have happened in the first place were the fsid in the injected monmap to be correct. So, if this is indeed a test cluster, it would be better to just start off fresh; otherwise, let me know and we may be able to put a quick and dirty fix to get your cluster back again.

Thanks!

  -Joao


[1] - http://tracker.ceph.com/issues/5343


--
Joao Eduardo Luis
Software Engineer | http://inktank.com | http://ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux