Re: Disaster recovery of monitor

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 2013-06-14 19:59, Joao Eduardo Luis wrote:
On 06/14/2013 02:39 PM, peter@xxxxxxxxx wrote:
On 2013-06-13 20:10, peter@xxxxxxxxx wrote:
On 2013-06-13 18:57, Joao Eduardo Luis wrote:
On 06/13/2013 05:25 PM, peter@xxxxxxxxx wrote:
On 2013-06-13 18:06, Gregory Farnum wrote:
On Thursday, June 13, 2013, wrote:

Hello,
We ran into a problem with our test cluster after adding monitors. It now seems that our main monitor doesn't want to start anymore. The
logs are flooded with:
2013-06-13 11:41:05.316982 7f7689ca4780 7 mon.a@0(leader).osd e2809
update_from_paxos  applying incremental 2810
2013-06-13 11:41:05.317043 7f7689ca4780 1 mon.a@0(leader).osd e2809
e2809: 9 osds: 9 up, 9 in
2013-06-13 11:41:05.317064 7f7689ca4780 7 mon.a@0(leader).osd e2809
update_from_paxos  applying incremental 2810
Is this accurate? It's applying the *same* incremental I've and over
again?
Yes, this is the current state:
Peter,
Can you point me to the full log of the monitor caught in this
apparent loop?
-Joao


Hi Joao,

Here it is:

http://www.2force.nl/ceph/ceph-mon.a.log.gz

Thanks,

Peter


Hi Joao,

Did you happen to figure out what is going on? If you need more log
files let me know.

Peter,

You can find all the updates on #5343 [1].

It is my understanding that you are running a test cluster; is this
correct?  If so, our suggestion is to start your monitor fresh.  We've
been able to figure out all the caused for this issue (thanks for your
help!):



- Injecting a monmap with a wrong fsid was the main culprit.  Given
you are on a version suffering from a bug that won't kill the monitor
if some sanity checks fail when the monitor is started, the monitor
was started even though said fsid mismatch was present.  A fix for
that will be hitting 0.61.4 soon, and has already hit master a few
days back.

- There was a bug in OSDMonitor::update_from_paxos() that would
ignore the return from OSDMap::apply_incremental(), thus leading to
the infinite loop in case the incremental failed to be applied.  That
should go into master soon.


However, with regard to getting the monitor running back again,
there's little we can do at the moment.  We don't believe the fix to
correct the incremental's fsid is necessary, as it should never happen
again once the patches are in and shouldn't even have happened in the
first place were the fsid in the injected monmap to be correct.  So,
if this is indeed a test cluster, it would be better to just start off
fresh; otherwise, let me know and we may be able to put a quick and
dirty fix to get your cluster back again.

Thanks!

  -Joao


[1] - http://tracker.ceph.com/issues/5343

Hi Joao,

You're welcome! Happy that we could help. I was at first hesitant to post to the mailinglist because I thought it was just user error. In this case it seems that due to our user error we uncovered a bug or at least something that should have never happened :) So if anyone out there is having the same feeling, just post. You never know what comes out.

Are there any other tips you might have for us and other users? Is it possible to have a backup of your monitor directory? Or is ensuring you have enough monitors enough? Is it possible for errors like this to be propagated to other monitors?

It would be really nice of there will be tools that can help with disaster recovery and some more documentation on this. I'm sure nobody would play around like we did with their live cluster but strange things do tend to happen (and bugs) and it is always nice to know if there is a way out. You don't want to end up with those petabytes sitting there :)

Thanks!

Peter

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux