Re: what's the best way to stop an MDS?

Patrick Donnelly <pdonnell@xxxxxxxxxx> · Wed, 19 Jun 2019 12:09:21 -0700

+dev@xxxxxxx

On Wed, Jun 19, 2019 at 11:18 AM Rishabh Dave <ridave@xxxxxxxxxx> wrote:
>
> Hi all,
>
> I am working on a ceph-ansible playbook[1] that removes an MDS from an
> already deployed Ceph cluster. Going through documentation and
> ceph-ansible codebase I found out 3 ways to stop an MDS -
>
> * ceph fail mds fail <mds-name> && rm -rf /var/lib/cephmds/ceph-{id} [2]
> * systemctl stop ceph-mds@$HOSTNAME
> * ceph tell mds.x exit
>
> How do these 3 ways compare to each other? I ran these commands on
> ceph-ansible deployed cluster and all 3 had the very same effect. Is
> any one of these better than the rest?

The first one doesn't cause the mds process to exit. I would suggest
the systemd approach as systemd may restart a daemon if it exits
normally (third approach).

> What about "ceph mds rm" and "ceph mds rmfailed"? The first time I was

Those are dev commands not meant for this purpose.

> looking for various ways to stop an MDS, I tried "ceph mds fail
> <mds-name> && ceph mds rm <global-id>" and it did not work since "ceph
> mds rm" requires an MDS to inactive[3]. Is there a way to render an
> MDS inactive? I couldn't find one.
>
> I also tried "ceph mds fail <mds-name> && ceph mds rmfailed
> <mds-rank>" but this did not stop MDS. It only changed MDS's state to
> 'standby" -
>
> (teuth-venv) $ ./bin/ceph fs dump | grep -A 1 standby_count_wanted 2> /dev/null
> dumped fsmap epoch 4
> standby_count_wanted    0
> 4232:   [v2:192.168.0.217:6826/2113356090,v1:192.168.0.217:6827/2113356090]
> 'a' mds.0.3 up:active seq 4
> (teuth-venv) $ ./bin/ceph mds fail a 2> /dev/null && ./bin/ceph mds
> rmfailed --yes-i-really-mean-it 0 2> /dev/null && ./bin/ceph fs dump |
> grep -A 3 Standby 2> /dev/null
> dumped fsmap epoch 6
> Standby daemons:
>
> 4286:   [v2:192.168.0.217:6826/401505106,v1:192.168.0.217:6827/401505106]
> 'a' mds.-1.0 up:standby seq 1
> (teuth-venv) $
>
> Also, I find the usage of "remove" in this doc[2] ambiguous -- it can
> mean removing MDS from cluster by changing MDS's state to standby or
> it can mean killing/stopping it altogether. Reading [2] my impression
> was that it meant killing/stopping it but "remove" is also used to
> describe "ceph mds rm" and "ceph mds rmfailed" commands. Of these, at
> least "ceph mds rmfailed" does not stop the MDS. If I am not the only
> one to find this ambiguous, I'll go ahead and change the docs
> accordingly.

[2] is not really useful documentation, unfortunately. The best way to
stop an MDS such that you want to permanently remove the daemon is to
just have the service manager (systemd) stop it. The only
consideration otherwise is whether you have a replacement MDS
available to take-over (if the operator even wants that to happen).

> [1] https://github.com/ceph/ceph-ansible/pull/4083
> [2] http://docs.ceph.com/docs/master/cephfs/add-remove-mds/
> [3] http://docs.ceph.com/docs/master/man/8/ceph/

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Senior Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx