Re: Help Removing Failed Cephadm Daemon(s) - MDS Deployment Issue

Adam King <adking@xxxxxxxxxx> · Thu, 20 Jan 2022 12:52:46 -0500

Hi Michael,

To clarify a bit further "ceph orch rm" works for removing services and
"ceph orch daemon rm" works for removing daemons. In the command you ran

[ceph: root@osd16 /]# ceph orch rm "mds.cephmon03.local osd16.local
osd17.local osd18.local.onl26.drymjr"

the name you've given there is the name of a daemon (note the ".drymjr" at
the end). To remove the service you must run "ceph orch rm" with the name
of the service as it appears in "ceph orch ls" output (where as what you
see in "ceph orch ps" output are daemon names). Otherwise, it won't be able
to find the service and therefore can't delete it. Furthermore, as long as
the service is still present, cephadm will keep attempting to place mds
daemons until the placement matches the service spec (which for this
service is just count:2 so it will try to make sure there are 2 mds daemons
placed) and will therefore replace the daemons you removed with the "ceph
orch daemon rm . . ." command. There is a way to make cephadm not do this
for a specific service by setting the unmanaged field to True (see
https://docs.ceph.com/en/latest/cephadm/services/#ceph.deployment.service_spec.ServiceSpec.unmanaged)
which would allow you to remove the daemons without them being replaced but
won't get rid of the service itself so I'd recommend continuing to try the
"ceph orch rm" command. If the service name you give matches what is shown
in the "ceph orch ls" output it should remove the service properly.

- Adam

On Thu, Jan 20, 2022 at 12:09 PM Poat, Michael <mpoat@xxxxxxx> wrote:

> Hello Adam et al.,
>
>
>
> Thank you for the reply and suggestions. I will work on deploying the
> additional services using .yaml and the instructions you suggested. First I
> need to get rid of these two stuck daemons. The ‘ ceph orch rm’ command
> fails to remove the service. If I use ‘ceph orch daemon rm’ the service
> gets removed but reappears shortly later on another host. Also adding
> –force doesn’t change the outcome.
>
> Initial state, notice the one daemon is running on cephmon03 & the other
> on osd26:
>
> [ceph: root@osd16 /]# ceph orch ps | grep error
>
> mds.cephmon03.local osd16.local osd17.local osd18.local.cephmon03.talhqb
> cephmon03.local  error          3m ago     36m  <unknown>
> docker.io/ceph/ceph:v15               <unknown>     <unknown>
>
> mds.cephmon03.local osd16.local osd17.local osd18.local.osd26.drymjr
> osd26.local      error          3m ago     46m  <unknown>
> docker.io/ceph/ceph:v15               <unknown>     <unknown>
>
>
>
> Trying Adam’s suggestion:
>
> [ceph: root@osd16 /]# ceph orch rm "mds.cephmon03.local osd16.local
> osd17.local osd18.local.onl26.drymjr"
>
> Failed to remove service. <mds.cephmon03.local osd16.local osd17.local
> osd18.local.osd26.drymjr> was not found.
>
>
>
> Service gets removed:
>
> [ceph: root@osd16 /]# ceph orch daemon rm "mds.cephmon03.local
> osd16.local osd17.local osd18.local.osd26.drymjr"
>
> Removed mds.cephmon03.local osd16.local osd17.local
> osd18.local.osd26.drymjr from host 'osd26.local'
>
>
>
> Ceph will stay in this state with only 1 failed daemon for a while, if I
> remove the 2nd one they both come back:
>
> [ceph: root@osd16 /]# ceph orch ps | grep error
>
> mds.cephmon03.local osd16.local osd17.local osd18.local.cephmon03.talhqb
> cephmon03.local  error          4m ago     37m  <unknown>
> docker.io/ceph/ceph:v15               <unknown>     <unknown>
>
>
>
> Removing the 2nd daemon:
>
> [ceph: root@osd16 /]# ceph orch daemon rm "mds.cephmon03.local
> osd16.local osd17.local osd18.local.cephmon03.talhqb"
>
> Removed mds.cephmon03.local osd16.local osd17.local
> osd18.local.cephmon03.talhqb from host 'cephmon03.local'
>
>
>
> After a few minutes….notice one daemon on cephmon03 & the other osd30.
> This seems random
>
> [ceph: root@osd16 /]# ceph orch ps | grep error
>
> mds.cephmon03.local osd16.local osd17.local osd18.local.cephmon03.pwtvcw
> cephmon03.local  error          53s ago    12m  <unknown>
> docker.io/ceph/ceph:v15               <unknown>     <unknown>
>
> mds.cephmon03.local osd16.local osd17.local osd18.local.osd30.ythzbh
> osd30.local      error          40s ago    15m  <unknown>
> docker.io/ceph/ceph:v15               <unknown>     <unknown>
>
>
>
> Any further suggestions are helpful.
>
>
>
> Thanks,
> -Michael
>
>
>
> *From:* Adam King <adking@xxxxxxxxxx>
> *Sent:* Wednesday, January 19, 2022 3:37 PM
> *To:* Poat, Michael <mpoat@xxxxxxx>
> *Cc:* ceph-users <ceph-users@xxxxxxx>
> *Subject:* Re:  Help Removing Failed Cephadm Daemon(s) - MDS
> Deployment Issue
>
>
>
> Hello Michael,
>
>
>
> If you're trying to remove all the mds daemons in this mds
> "cephmon03.local osd16.local osd17.local osd18.local" I think the command
> would be "ceph orch rm "mds.cephmon03.local osd16.local osd17.local
> osd18.local"" (note the quotes around that mds.cepmon . . .
> since cephadm thinks this is the service id rather than the placement as I
> think you intended. I'd check "ceph orch apply mds -h" to see the order it
> takes the args). Also, as for getting the mds or any other daemon where you
> want them, I'd recommend taking a look at
> https://docs.ceph.com/en/pacific/cephadm/services/#service-specification
> <https://urldefense.com/v3/__https:/docs.ceph.com/en/pacific/cephadm/services/*service-specification__;Iw!!P4SdNyxKAPE!R-fa0E_N0R4J566OABLofDU2As0p7eZBcVi0G3WFI_KcrXSZKDnUlDRU95sXkg$>
> and the subsequent
> https://docs.ceph.com/en/pacific/cephadm/services/#daemon-placement
> <https://urldefense.com/v3/__https:/docs.ceph.com/en/pacific/cephadm/services/*daemon-placement__;Iw!!P4SdNyxKAPE!R-fa0E_N0R4J566OABLofDU2As0p7eZBcVi0G3WFI_KcrXSZKDnUlDQdX098Pw$>
> (it's pacific docs rather than octopus but most of it should still apply).
> Basically, cephadm works in a declarative fashion so if you tell it to
> apply mds to host1 then tell it to apply mds to host2 it will ONLY put the
> mds on host2 in the end as the second command overwrites the first. If you
> want the mds on both host1 and host2 you need to tell it that in the
> placement up front. The easiest way to do so is to create a yaml spec and
> apply it as outlined in the linked documentation (it can be done with cli
> directly as well but it's trickier to format the placement correctly so I
> recommend yaml).
>
>
>
> - Adam
>
>
>
> On Wed, Jan 19, 2022 at 2:58 PM Poat, Michael <mpoat@xxxxxxx> wrote:
>
> Hello,
>
> In a production cluster am running Ceph Octopus on a CentOS 7 based
> cluster. Use only for CephFS. I used Cephadm for deployment w/ docker-ce. I
> have 3 (intended) MDS/MON/MGR nodes + 30 storage nodes 4 OSD each (120 OSD
> total).
>
> I have a few issues with Ceph Octopus that have led me to where I am now.
> I will try to be specific and avoid taking you far down the rabbit hole.
>
> When I deployed the cluster, the 3 intended MDS/MON/MGR (named: cephmon01,
> cephmon02, cephmon03) were not available so 3 of the OSD nodes took place
> to run these services (names: osd16, osd17, osd18). osd{16..18} are running
> 4xOSD/1xMDS/1xMON/1xMGR while the remaining osd{01..30} are running 4xOSD
> only. In efforts to migrate the MDS/MON/MGR services off the 3 OSD nodes,
> specifically the MDS services, to the 'cephmon{01..03}' nodes I have tried
> deploying the MDS service onto the 3 additional nodes using the "ceph orch
> apply mds *<hostname>*" command but nothing happens. I believe this was due
> to the limit of 3 MDS's set similar to
> https://docs.ceph.com/en/octopus/cephadm/install/#deploy-additional-monitors-optional
> <https://urldefense.com/v3/__https:/docs.ceph.com/en/octopus/cephadm/install/*deploy-additional-monitors-optional__;Iw!!P4SdNyxKAPE!R-fa0E_N0R4J566OABLofDU2As0p7eZBcVi0G3WFI_KcrXSZKDnUlDTQAc7Elw$>.
> In a test cluster I found that if simply increase the number of MDS servers
> using 'ceph orch apply mds 6, it will deploy 3 additional MDS services on 3
> random OSD nodes and not on my cephmon{01..03} nodes, hence the reason I
> tried to deploy the service on a specific node. I didn't try to increase
> the number of MDS this way on my production cluster.
>
> Additionally, I found that the host labels in 'ceph orch host ls' don't do
> anything and going that route by adding labels to hosts is a façade.
>
> THE ISSUE: What I tried was a command like this (which worked with
> deployment of additional MGR servers)
> ceph orch apply mds "cephmon03.local osd16.local osd17.local osd18.local"
> the idea being to deploy the MDS service on 1 new machine and 3 existing
> nodes...however this failed miserably and I am now stuck with the following
>
> [ceph: root@osd16 /]# ceph -s
>   cluster:
>     id:     1111111111111111111111111111111111
>     health: HEALTH_WARN
>             2 failed cephadm daemon(s)
>
>   services:
>     mon: 5 daemons, quorum osd16.local osd17,osd18,osd19,osd20 (age 15h)
>     mgr: cephmon03.xjyrgm(active, since 5w), standbys: osd18.voydwg,
> osdl17.tqyxxn, osd16.nkzhik
>     mds: cephfs:1 {0=cephfs.osd16.dlircu=up:active} 2 up:standby
>     osd: 120 osds: 120 up (since 2h), 120 in (since 2w)
>
> [ceph: root@osd16 /]# ceph orch ls mds
> NAME
>                RUNNING  REFRESHED  AGE  PLACEMENT  IMAGE NAME
>  IMAGE ID
> mds.cephfs
>                    3/3  3m ago     5M   count:3    docker.io/ceph/ceph:v15
> <https://urldefense.com/v3/__http:/docker.io/ceph/ceph:v15__;!!P4SdNyxKAPE!R-fa0E_N0R4J566OABLofDU2As0p7eZBcVi0G3WFI_KcrXSZKDnUlDSJ3yPqtA$>
> 2cf504fded39
> mds.cephmon03.local osd16.local osd17.local osd18.local      0/2  3m ago
>    5w   count:2    docker.io/ceph/ceph:v15
> <https://urldefense.com/v3/__http:/docker.io/ceph/ceph:v15__;!!P4SdNyxKAPE!R-fa0E_N0R4J566OABLofDU2As0p7eZBcVi0G3WFI_KcrXSZKDnUlDSJ3yPqtA$>
> <unknown>
>
> [ceph: root@osd16 /]# ceph orch ps
> NAME
>
>                                           HOST                     STATUS
>       REFRESHED  AGE  VERSION    IMAGE NAME
> IMAGE ID      CONTAINER ID
> ...
> mds.cephmon03.local osd16.local osd17.local osd18.local.osd14.pdgelh
> osd14.local      error         4m ago     6d   <unknown>
> docker.io/ceph/ceph:v15
> <https://urldefense.com/v3/__http:/docker.io/ceph/ceph:v15__;!!P4SdNyxKAPE!R-fa0E_N0R4J566OABLofDU2As0p7eZBcVi0G3WFI_KcrXSZKDnUlDSJ3yPqtA$>
>              <unknown>     <unknown>
> mds.cephmon03.local osd16.local osd17.local osd18.local.osd26.zdlarv
> osd26.local      error         3m ago     6d   <unknown>
> docker.io/ceph/ceph:v15
> <https://urldefense.com/v3/__http:/docker.io/ceph/ceph:v15__;!!P4SdNyxKAPE!R-fa0E_N0R4J566OABLofDU2As0p7eZBcVi0G3WFI_KcrXSZKDnUlDSJ3yPqtA$>
>              <unknown>     <unknown>
> ...
>
> If I try to do something like ceph orch daemon rm 'mds.cephmon03.local
> osd16.local osd17.local osd18.local.osd14.pdgelh' (with or without a
> -force). The daemon will just reappear on a different HOST when checking
> back with ceph orch ps
>
> For starters, any idea on how to remove the failed MDS deployment command?
> It looks like instead of adding additional nodes to the existing
> mds.cephfs, it tried creating a 2nd parallel MDS deployment.
>
>
> Thanks,
> -Michael
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx