Re: Ceph MON can no longer join quorum

Greg Poirier <greg.poirier@xxxxxxxxxx> · Wed, 5 Feb 2014 09:02:34 -0800

Hi Karan,
I resolved it the same way you did. We had a network partition that caused the MON to die, it appears.

I'm running 0.72.1

It would be nice if redeploying wasn't the solution, but if it's simply cleaner to do so, then I will continue along that route.

I think what's more troubling is that when this occurred we lost all connectivity to the Ceph cluster.

On Wed, Feb 5, 2014 at 1:11 AM, Karan Singh <ksingh@xxxxxx> wrote:

Hi Greg

I have seen this problem before in my cluster.

What ceph version you are running 
Did you made any change recently in the cluster , that resulted in this problem

You identified correct , the only problem is ceph-mon-2003  is listening to incorrect port , it should listen on port 6789 ( like the other two monitors ) . How i resolved is cleanly removing the infected monitor node and adding it back to cluster.

Regards
Karan

From: "Greg Poirier" <greg.poirier@xxxxxxxxxx>
To: ceph-users@xxxxxxxxxxxxxx

Sent: Tuesday, 4 February, 2014 10:50:21 PM
Subject: [ceph-users] Ceph MON can no longer join quorum

I have a MON that at some point lost connectivity to the rest of the cluster and now cannot rejoin.

Each time I restart it, it looks like it's attempting to create a new MON and join the cluster, but the rest of the cluster rejects it, because the new one isn't in the monmap.

I don't know why it suddenly decided it needed to be a new MON.

I am not really sure where to start. 

root@ceph-mon-2003:/var/log/ceph# ceph -s

    cluster 4167d5f2-2b9e-4bde-a653-f24af68a45f8
     health HEALTH_ERR 1 pgs inconsistent; 2 pgs peering; 126 pgs stale; 2 pgs stuck inactive; 126 pgs stuck stale; 2 pgs stuck unclean; 10 requests are blocked > 32 sec; 1 scrub errors; 1 mons down, quorum 0,1 ceph-mon-2001,ceph-mon-2002

     monmap e2: 3 mons at {ceph-mon-2001=10.30.66.13:6789/0,ceph-mon-2002=10.30.66.14:6789/0,ceph-mon-2003=10.30.66.15:6800/0}, election epoch 12964, quorum 0,1 ceph-mon-2001,ceph-mon-2002

Notice ceph-mon-2003:6800

If I try to start ceph-mon-all, it will be listening on some other port...

root@ceph-mon-2003:/var/log/ceph# start ceph-mon-all

ceph-mon-all start/running
root@ceph-mon-2003:/var/log/ceph# ps -ef | grep ceph
root      6930     1 31 15:49 ?        00:00:00 /usr/bin/ceph-mon --cluster=ceph -i ceph-mon-2003 -f
root      6931     1  3 15:49 ?        00:00:00 python /usr/sbin/ceph-create-keys --cluster=ceph -i ceph-mon-2003

root@ceph-mon-2003:/var/log/ceph# ceph -s
2014-02-04 15:49:56.854866 7f9cf422d700  0 -- :/1007028 >> 10.30.66.15:6789/0 pipe(0x7f9cf0021370 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f9cf00215d0).fault

    cluster 4167d5f2-2b9e-4bde-a653-f24af68a45f8
     health HEALTH_ERR 1 pgs inconsistent; 2 pgs peering; 126 pgs stale; 2 pgs stuck inactive; 126 pgs stuck stale; 2 pgs stuck unclean; 10 requests are blocked > 32 sec; 1 scrub errors; 1 mons down, quorum 0,1 ceph-mon-2001,ceph-mon-2002

     monmap e2: 3 mons at {ceph-mon-2001=10.30.66.13:6789/0,ceph-mon-2002=10.30.66.14:6789/0,ceph-mon-2003=10.30.66.15:6800/0}, election epoch 12964, quorum 0,1 ceph-mon-2001,ceph-mon-2002

Suggestions?

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com