Re: After reboot nothing worked

Joao Eduardo Luis <joao.luis@xxxxxxxxxxx> · Tue, 17 Dec 2013 13:48:28 +0000

On 17/12/13 13:36, Umar Draz wrote:
Hi Joao,

Thanks for this valuable information. Ok another problem, I want to
remove the mon host from the cluster here is my mon dump output

root@vms2:~# ceph mon dump
dumped monmap epoch 1
epoch 1
fsid 6ce085b5-1747-46f6-9fda-a3f1e8c75beb
last_changed 0.000000
created 0.000000
0: 192.168.1.128:6789/0 <http://192.168.1.128:6789/0> mon.vms1
1: 192.168.1.129:6789/0 <http://192.168.1.129:6789/0> mon.vms2

I tried to remove the the mon.vms2 from the cluster following this
document http://ceph.com/docs/master/rados/operations/add-or-rm-mons/

but again its not worked.

root@vms2:~# service ceph -a stop mon.vms2
/etc/init.d/ceph: mon.vms2 not found (/etc/ceph/ceph.conf defines ,
/var/lib/ceph defines )

root@vms2:/etc/ceph# ceph mon remove mon.vms2
mon mon.vms2 does not exist or has already been removed

Once you stop the monitor it stops being reachable by the last remaining 
monitor, thus the cluster loses quorum, therefore you are unable to talk 
to the cluster.

I can see how you bumped into that however.  The docs should have 
disclaimed that the presented order would only ever work on a cluster 
with 3+ monitors.

Try removing the monitor first and then stopping it.  Considering you 
have it on upstart, I would suggest you'd first remove ceph-mon from 
upstart to make sure it is not restarted straight away -- which I 
believe will lead to http://tracker.ceph.com/issues/6789

So, this would be the order I'd try:

1. remove ceph-mon from upstart or whatever to avoid it being restarted 
once it kills itself on the next step
2. 'ceph mon remove mon.vms2'
3. make sure ceph-mon with id mon.vms2 is not running
4. run 'ceph -s' to make sure it works and 'mon.vms2' is not in the quorum.

  -Joao

Br.

Umar

On Tue, Dec 17, 2013 at 6:18 PM, Karan Singh <ksingh@xxxxxx
<mailto:ksingh@xxxxxx>> wrote:

    Thanks Joao for information.

    Many Thanks
    Karan Singh

    ----- Original Message -----
    From: "Joao Eduardo Luis" <joao.luis@xxxxxxxxxxx
    <mailto:joao.luis@xxxxxxxxxxx>>
    To: ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
    Sent: Tuesday, 17 December, 2013 2:56:23 PM
    Subject: Re:  After reboot nothing worked

    On 12/17/2013 09:54 AM, Karan Singh wrote:
     > Umar
     >
     > *Ceph is stable for production* , there are a large number of ceph
     > clusters deployed and running smoothly in PRODUCTIONS and
    countless in
     > testing / pre-production.
     >
     > Since you are facing problems with your ceph testing , it does
    not mean
     > CEPH is unstable.
     >
     > I would suggest put some time troubleshooting your problem.
     >
     > What i see from your logs  --
     >
     >   1) you have 2 Mons thats a problem ( either have 1  or have 3
    to form
     > quorum ) . Add 1 more monitor node

    Just to clarify this point a bit, one doesn't need an odd number of
    monitors in a ceph cluster to reach quorum.  This is a common
    misconception.

    The requirement to reach quorum is simply to have a majority of monitors
    able to talk to each other.  If one has 2 monitors and both are able to
    talk to each other they'll be able to form a quorum.

    Odd-numbers are advised however because one can tolerate as much
    failures with less infrastructure. E.g.,

    - for n = 1, failure of 1 monitor means loss of quorum
    - for n = 2, failure of 1 monitor means loss of quorum
    - for n = 3, failure of 1 monitor is okay; failure of 2 monitors means
    loss of quorum
    - for n = 4, failure of 1 monitor is okay; failure of 2 monitors means
    loss of quorum
    - for n = 5, failure of 2 monitors is okay; failure of 3 monitors means
    loss of quorum
    - for n = 6, failure of 2 monitors is okay; failure of 3 monitors means
    loss of quorum

    etc.

    So you can see how you don't get any benefits, from an availability
    perspective, by having either 2, 4 or 6 monitors when compared to having
    1, 3, 5.  If your target however is replication, then 2 is better
    than 1.

        -Joao

    --
    Joao Eduardo Luis
    Software Engineer | http://inktank.com | http://ceph.com
    _______________________________________________
    ceph-users mailing list
    ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
    http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
    _______________________________________________
    ceph-users mailing list
    ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
    http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--
Umar Draz
Network Architect

--
Joao Eduardo Luis
Software Engineer | http://inktank.com | http://ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com