Re: Ceph MON can no longer join quorum

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Karan,

I resolved it the same way you did. We had a network partition that caused the MON to die, it appears.

I'm running 0.72.1

It would be nice if redeploying wasn't the solution, but if it's simply cleaner to do so, then I will continue along that route.

I think what's more troubling is that when this occurred we lost all connectivity to the Ceph cluster.


On Wed, Feb 5, 2014 at 1:11 AM, Karan Singh <ksingh@xxxxxx> wrote:

Hi Greg


I have seen this problem before in my cluster.


  • What ceph version you are running 
  • Did you made any change recently in the cluster , that resulted in this problem


You identified correct , the only problem is ceph-mon-2003  is listening to incorrect port , it should listen on port 6789 ( like the other two monitors ) . How i resolved is cleanly removing the infected monitor node and adding it back to cluster.


Regards

Karan



From: "Greg Poirier" <greg.poirier@xxxxxxxxxx>
To: ceph-users@xxxxxxxxxxxxxx
Sent: Tuesday, 4 February, 2014 10:50:21 PM
Subject: [ceph-users] Ceph MON can no longer join quorum


I have a MON that at some point lost connectivity to the rest of the cluster and now cannot rejoin.

Each time I restart it, it looks like it's attempting to create a new MON and join the cluster, but the rest of the cluster rejects it, because the new one isn't in the monmap.

I don't know why it suddenly decided it needed to be a new MON.

I am not really sure where to start. 

root@ceph-mon-2003:/var/log/ceph# ceph -s
    cluster 4167d5f2-2b9e-4bde-a653-f24af68a45f8
     health HEALTH_ERR 1 pgs inconsistent; 2 pgs peering; 126 pgs stale; 2 pgs stuck inactive; 126 pgs stuck stale; 2 pgs stuck unclean; 10 requests are blocked > 32 sec; 1 scrub errors; 1 mons down, quorum 0,1 ceph-mon-2001,ceph-mon-2002
     monmap e2: 3 mons at {ceph-mon-2001=10.30.66.13:6789/0,ceph-mon-2002=10.30.66.14:6789/0,ceph-mon-2003=10.30.66.15:6800/0}, election epoch 12964, quorum 0,1 ceph-mon-2001,ceph-mon-2002

Notice ceph-mon-2003:6800

If I try to start ceph-mon-all, it will be listening on some other port...

root@ceph-mon-2003:/var/log/ceph# start ceph-mon-all
ceph-mon-all start/running
root@ceph-mon-2003:/var/log/ceph# ps -ef | grep ceph
root      6930     1 31 15:49 ?        00:00:00 /usr/bin/ceph-mon --cluster=ceph -i ceph-mon-2003 -f
root      6931     1  3 15:49 ?        00:00:00 python /usr/sbin/ceph-create-keys --cluster=ceph -i ceph-mon-2003

root@ceph-mon-2003:/var/log/ceph# ceph -s
2014-02-04 15:49:56.854866 7f9cf422d700  0 -- :/1007028 >> 10.30.66.15:6789/0 pipe(0x7f9cf0021370 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f9cf00215d0).fault
    cluster 4167d5f2-2b9e-4bde-a653-f24af68a45f8
     health HEALTH_ERR 1 pgs inconsistent; 2 pgs peering; 126 pgs stale; 2 pgs stuck inactive; 126 pgs stuck stale; 2 pgs stuck unclean; 10 requests are blocked > 32 sec; 1 scrub errors; 1 mons down, quorum 0,1 ceph-mon-2001,ceph-mon-2002
     monmap e2: 3 mons at {ceph-mon-2001=10.30.66.13:6789/0,ceph-mon-2002=10.30.66.14:6789/0,ceph-mon-2003=10.30.66.15:6800/0}, election epoch 12964, quorum 0,1 ceph-mon-2001,ceph-mon-2002

Suggestions?

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux