Help Removing Failed Cephadm Daemon(s) - MDS Deployment Issue

"Poat, Michael" <mpoat@xxxxxxx> · Wed, 19 Jan 2022 19:58:01 +0000

Hello,

In a production cluster am running Ceph Octopus on a CentOS 7 based cluster. Use only for CephFS. I used Cephadm for deployment w/ docker-ce. I have 3 (intended) MDS/MON/MGR nodes + 30 storage nodes 4 OSD each (120 OSD total).

I have a few issues with Ceph Octopus that have led me to where I am now. I will try to be specific and avoid taking you far down the rabbit hole.

When I deployed the cluster, the 3 intended MDS/MON/MGR (named: cephmon01, cephmon02, cephmon03) were not available so 3 of the OSD nodes took place to run these services (names: osd16, osd17, osd18). osd{16..18} are running 4xOSD/1xMDS/1xMON/1xMGR while the remaining osd{01..30} are running 4xOSD only. In efforts to migrate the MDS/MON/MGR services off the 3 OSD nodes, specifically the MDS services, to the 'cephmon{01..03}' nodes I have tried deploying the MDS service onto the 3 additional nodes using the "ceph orch apply mds *<hostname>*" command but nothing happens. I believe this was due to the limit of 3 MDS's set similar to https://docs.ceph.com/en/octopus/cephadm/install/#deploy-additional-monitors-optional. In a test cluster I found that if simply increase the number of MDS servers using 'ceph orch apply mds 6, it will deploy 3 additional MDS services on 3 random OSD nodes and not on my cephmon{01..03} nodes, hence the reason I tried to deploy the service on a specific node. I didn't try to increase the number of MDS this way on my production cluster.

Additionally, I found that the host labels in 'ceph orch host ls' don't do anything and going that route by adding labels to hosts is a façade.

THE ISSUE: What I tried was a command like this (which worked with deployment of additional MGR servers)
ceph orch apply mds "cephmon03.local osd16.local osd17.local osd18.local" the idea being to deploy the MDS service on 1 new machine and 3 existing nodes...however this failed miserably and I am now stuck with the following

[ceph: root@osd16 /]# ceph -s
  cluster:
    id:     1111111111111111111111111111111111
    health: HEALTH_WARN
            2 failed cephadm daemon(s)

  services:
    mon: 5 daemons, quorum osd16.local osd17,osd18,osd19,osd20 (age 15h)
    mgr: cephmon03.xjyrgm(active, since 5w), standbys: osd18.voydwg, osdl17.tqyxxn, osd16.nkzhik
    mds: cephfs:1 {0=cephfs.osd16.dlircu=up:active} 2 up:standby
    osd: 120 osds: 120 up (since 2h), 120 in (since 2w)

[ceph: root@osd16 /]# ceph orch ls mds
NAME                                                                                     RUNNING  REFRESHED  AGE  PLACEMENT  IMAGE NAME               IMAGE ID
mds.cephfs                                                                                   3/3  3m ago     5M   count:3    docker.io/ceph/ceph:v15  2cf504fded39
mds.cephmon03.local osd16.local osd17.local osd18.local      0/2  3m ago     5w   count:2    docker.io/ceph/ceph:v15  <unknown>

[ceph: root@osd16 /]# ceph orch ps
NAME                                                                                                                                                                                            HOST                     STATUS        REFRESHED  AGE  VERSION    IMAGE NAME                            IMAGE ID      CONTAINER ID
...
mds.cephmon03.local osd16.local osd17.local osd18.local.osd14.pdgelh  osd14.local      error         4m ago     6d   <unknown>  docker.io/ceph/ceph:v15               <unknown>     <unknown>
mds.cephmon03.local osd16.local osd17.local osd18.local.osd26.zdlarv  osd26.local      error         3m ago     6d   <unknown>  docker.io/ceph/ceph:v15               <unknown>     <unknown>
...

If I try to do something like ceph orch daemon rm 'mds.cephmon03.local osd16.local osd17.local osd18.local.osd14.pdgelh' (with or without a -force). The daemon will just reappear on a different HOST when checking back with ceph orch ps

For starters, any idea on how to remove the failed MDS deployment command? It looks like instead of adding additional nodes to the existing mds.cephfs, it tried creating a 2nd parallel MDS deployment.

Thanks,
-Michael
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx