Re: Disaster recovery of monitor

Jose Tavares <jat@xxxxxxxxxxxx> · Tue, 17 Nov 2015 15:15:17 -0200

Hi guys ...

Thanks a lot for your support.
I discovered what happened.

I had 2 monitors, osnode01 and osnode02.
I tried do add a 3rd by using ceph-deploy.

ceph-deploy was using a key different from the one used by my monitor cluster.

So, I added osnode08 to the monitor cluster and it did not become part of the quorum.
I removed it, and removed osnode02. The monitor count should be in an odd number.

When I did that, my ceph stopped.
I readded osnode02 to the monitor cluster.
The thing is that I added using a wrong key. I don't know why ceph-deploy started using a different key.

As suggested by Joao Eduardo, removing auth I could bring part of ceph up.
After that I troubleshooted this key problem, solved it, and know my whole cluster is recovered and running fine ...

Thanks a lot.
Jose Tavares

On Tue, Nov 17, 2015 at 11:13 AM, Jose Tavares <jat@xxxxxxxxxxxx> wrote:
Now I tried to inject the latest map I had.Also, I created a second monitor on osnode02, like I had before, using the same map.
I started both monitors ...

Logs from osnode01 show my content ... and then it started to show lines like 

2015-11-17 10:56:26.515069 7fc73af67700  0 mon.osnode01@0(probing).data_health(1) update_stats avail 19% total 220 GB, used 178 GB, avail 43178 MB

What does that mean?
Attached are the logs.

Thanks a lot.
Jose Tavares

On Tue, Nov 17, 2015 at 10:33 AM, Jose Tavares <jat@xxxxxxxxxxxx> wrote:

On Tue, Nov 17, 2015 at 7:27 AM, Joao Eduardo Luis <joao@xxxxxxx> wrote:
On 11/17/2015 03:56 AM, Jose Tavares wrote:

> The problem is that I think I don't have any good monitor anymore.

> How do I know if the map I am trying is ok?

>

> I also saw in the logs that the primary mon was trying to contact a

> removed mon at IP .112 .. So, I added .112 again ... and it didn't help.

>

> Attached are the logs of what is going on and some monmaps that I

> capture that were from minutes before the cluster become inaccessible ..

>

> Should I try inject this monmaps in my primary mon to see if it can

> recover the cluster?

> Is it possible to see if this monmaps match my content?

Without access to the actual store.db there's no way to ascertain if the

store has any problems, and even then figuring out a potential

corruption from just one monitor store.db would either be impossible or

impractical.

I posted my store.db in my previous answer ..

That said, from the log you attached it seems you only have issues with

authentication: you have pgmaps from epoch 91923 through to 92589, you

have an mds map (epoch 38), osdmaps at least through epoch 307, and 40

versions for the auth keys.

Somehow, though, your monitors are unable to authenticate each other. No

way to tell if that was corruption or user error.

You should be able to get your monitors back to speaking terms again

simply by disabling cephx temporarily. Then you can figure out whatever

you need to figure out in terms of monitor keys.

Just update your ceph.conf with 'auth supported = none' and restart the

monitors. See how it goes from there.

I tried your suggestion and it didn't make any change to the results .. :(

Thanks a lot.
Jose Tavares 

HTH

  -Joao

>

> Thanks a lot.

> Jose Tavares

>

>

>

>

>

> On Mon, Nov 16, 2015 at 10:48 PM, Nathan Harper

> <nathan.harper@xxxxxxxxxxx <mailto:nathan.harper@xxxxxxxxxxx>> wrote:

>

>     I had to go through a similar process when we had a disaster which

>     destroyed one of our monitors.   I followed the process here:

>     REMOVING MONITORS FROM AN UNHEALTHY CLUSTER

>     <http://docs.ceph.com/docs/hammer/rados/operations/add-or-rm-mons/> to

>     remove all but one monitor, which let me bring the cluster back up.

>

>     As you are running an older version of Ceph than hammer, some of the

>     commands might differ (perhaps this might

>     help http://docs.ceph.com/docs/v0.80/rados/operations/add-or-rm-mons/)

>

>

>     --

>     *Nathan Harper*// IT Systems Architect

>

>     *e: * nathan.harper@xxxxxxxxxxx <mailto:nathan.harper@xxxxxxxxxxx>

>     // *t: * 0117 906 1104 // *m: * 07875 510891 // *w: *

>     www.cfms.org.uk <http://www.cfms.org.uk%22> // Linkedin grey icon

>     scaled <http://uk.linkedin.com/pub/nathan-harper/21/696/b81>

>     CFMS Services Ltd// Bristol & Bath Science Park // Dirac Crescent //

>     Emersons Green // Bristol // BS16 7FR

>

>     CFMS Services Ltd is registered in England and Wales No 05742022 - a

>     subsidiary of CFMS Ltd

>     CFMS Services Ltd registered office // Victoria House // 51 Victoria

>     Street // Bristol // BS1 6AD

>

>     On 16 November 2015 at 16:50, Jose Tavares <jat@xxxxxxxxxxxx

>     <mailto:jat@xxxxxxxxxxxx>> wrote:

>

>         Hi guys ...

>         I need some help as my cluster seems to be corrupted.

>

>         I saw here ..

>         https://www.mail-archive.com/ceph-users@xxxxxxxxxxxxxx/msg01919.html

>         .. a msg from 2013 where Peter had a problem with his monitors.

>

>         I had the same problem today when trying to add a new monitor,

>         and than playing with monmap as the monitors were not entering

>         the quorum. I'm using version 0.80.8.

>

>         Right now my cluster won't start because of a corrupted monitor.

>         Is it possible to remove all monitors and create just a new one

>         without losing data? I have ~260GB of data with work from 2 weeks.

>

>         What should I do? Do you recommend any specific procedure?

>

>         Thanks a lot.

>         Jose Tavares

>

>         _______________________________________________

>         ceph-users mailing list

>         ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>

>         http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>

>

>

>

>

> _______________________________________________

> ceph-users mailing list

> ceph-users@xxxxxxxxxxxxxx

> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com