Re: [ext] Re: Rename / change host names set with `ceph orch host add`

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hey Adam,

thanks again for your help.

I finally got around to execute your suggested procedure. It went mostly find.
Except when I renamed the last host, I ended up with a rogue mon.
I assume a new mon was created on a different host while the last one was "out" of cephadm.
And the remaining mon on the last host is now not cleaned up by cephadm (maybe due to being legacy).

I got the following warnings (relevant sections):
[WRN] CEPHADM_APPLY_SPEC_FAIL: Failed to apply 1 service(s): osd.all-available-devices
    osd.all-available-devices: host osd-mirror-1 `cephadm ceph-volume` failed: cephadm exited with an error code: 1, stderr:Inferring config /var/lib/ceph/7efa00f9-182f-40f4-9136-d51895db1f0b/mon.osd-mirror-1/config
ERROR: [Errno 2] No such file or directory: '/var/lib/ceph/7efa00f9-182f-40f4-9136-d51895db1f0b/mon.osd-mirror-1/config'
[WRN] CEPHADM_REFRESH_FAILED: failed to probe daemons or devices
    host osd-mirror-1 `cephadm ceph-volume` failed: cephadm exited with an error code: 1, stderr:Inferring config /var/lib/ceph/7efa00f9-182f-40f4-9136-d51895db1f0b/mon.osd-mirror-1/config
ERROR: [Errno 2] No such file or directory: '/var/lib/ceph/7efa00f9-182f-40f4-9136-d51895db1f0b/mon.osd-mirror-1/config'

The mon is not running and required anymore. It's not listed via `systemctl`, `ceph orch ps` nor `ceph status` anymore:
mon:           3 daemons, quorum osd-mirror-3,osd-mirror-2,osd-mirror-6 (age 3m)

But cephadm is still aware of the late daemon and trying use (?) it. From `cephadm ls`:
    {
        "style": "legacy",
        "name": "mon.osd-mirror-1",
        "fsid": "7efa00f9-182f-40f4-9136-d51895db1f0b",
        "systemd_unit": "ceph-mon@osd-mirror-1",
        "enabled": false,
        "state": "unknown",
        "host_version": "15.2.14"
    },

Tried to remove it (which didn't help):
cephadm rm-daemon --name mon.osd-mirror-1 --fsid 7efa00f9-182f-40f4-9136-d51895db1f0b --force --force-delete-data

And then figured out cephadm is not supposed to remove legacy daemons:
https://tracker.ceph.com/issues/45976

Also tried some manual removal without success:
0|0[root@osd-mirror-1 ~]# service ceph -a stop mon.osd-mirror-1
The service command supports only basic LSB actions (start, stop, restart, try-restart, reload, force-reload, status). For other actions, please try to use systemctl.
0|0[root@osd-mirror-1 ~]# ceph mon remove osd-mirror-1
mon.osd-mirror-1 does not exist or has already been removed

What other options do I have to remove this daemon?
I.e. the rest of information cephadm keeps, resulting in it thinking the mon would be available?

Thanks again for all your help.

Best, Mathias

On 5/20/2022 5:16 PM, Adam King wrote:
To clarify a bit, "ceph orch host rm <hostname> --force" won't actually touch any of the daemons on the host. It just stops cephadm from managing the host. I.e. it won't add/remove daemons on the host. If you remove the host then re-add it with the new host name nothing should actually happen to the daemons there. The only possible exception is if you have services whose placement uses count and one of the daemons from that service is on the host being temporarily removed. It's possible it could try to deploy that daemon on another host in the interim. However, OSDs are never like that so there would never be any need for flags like no-out or no-backfill. The worst case would be it moving a mon or mgr around. If you make sure all the important services are deployed by label, explicit hosts etc. (just not count) then there should be no risk of any daemons moving at all and this is a pretty safe operation.

On Fri, May 20, 2022 at 3:36 AM Kuhring, Mathias <mathias.kuhring@xxxxxxxxxxxxxx<mailto:mathias.kuhring@xxxxxxxxxxxxxx>> wrote:

Hey Adam,

thanks for your fast reply.

That's a bit more invasive and risky than I was hoping for.
But if this is the only way, I guess we need to do this.

Would it be advisable to put some maintenance flags like noout, nobackfill, norebalance?
And maybe stop the ceph target on the host I'm re-adding to pause all daemons?

Best, Mathias

On 5/19/2022 8:14 PM, Adam King wrote:
cephadm just takes the hostname given in the "ceph orch host add" commands and assumes it won't change. The FQDN names (or whatever "ceph orch host ls" shows in any scenario) are from whatever input was given in those commands. Cephadm will even try to verify the hostname matches what is given when adding the host. As for where it is stored, we keep that info in the mon key store and it isn't meant to be manually updated (ceph config-key get mgr/cephadm/inventory). Although, there have occasionally been people running into issues related to a mismatch between an FQDN and a shortname. There's no built-in command for changing a hostname because of the expectation that it won't change. However, you should be able to fix this by removing and re-adding the host. E.g. "ceph orch host rm osd-mirror-1.our.domain.org<http://osd-mirror-1.our.domain.org/>" followed by "ceph orch host add osd-mirror-1 172.16.62.22 --labels rgw --labels osd". If you're on a late enough version that it requests you drain the host before we'll remove it (it was some pacific dot release, don't remember which one) you can pass --force to the host rm command. Generally it's not a good idea to remove hosts from cephadm's control while there are still cephadm deployed daemons on it like that but this is a special case. Anyway, removing and re-adding the host is the only (reasonable) way to change what it has stored for the hostname that I can remember.

Let me know if that doesn't work,
 - Adam King

On Thu, May 19, 2022 at 1:41 PM Kuhring, Mathias <mathias.kuhring@xxxxxxxxxxxxxx<mailto:mathias.kuhring@xxxxxxxxxxxxxx>> wrote:
Dear ceph users,

one of our cluster is complaining about plenty of stray hosts and
daemons. Pretty much all of them.

[WRN] CEPHADM_STRAY_HOST: 6 stray host(s) with 280 daemon(s) not managed
by cephadm
     stray host osd-mirror-1 has 47 stray daemons:
['mgr.osd-mirror-1.ltmyyh', 'mon.osd-mirror-1', 'osd.1', ...]
     stray host osd-mirror-2 has 46 stray daemons: ['mon.osd-mirror-2',
'osd.0', ...]
     stray host osd-mirror-3 has 48 stray daemons:
['cephfs-mirror.osd-mirror-3.qzcuvv', 'mgr.osd-mirror-3',
'mon.osd-mirror-3', 'osd.101', ...]
     stray host osd-mirror-4 has 47 stray daemons:
['mds.cephfs.osd-mirror-4.omjlxu', 'mgr.osd-mirror-4', 'osd.103', ...]
     stray host osd-mirror-5 has 46 stray daemons: ['mgr.osd-mirror-5',
'osd.139', ...]
     stray host osd-mirror-6 has 46 stray daemons:
['mds.cephfs.osd-mirror-6.hobjsy', 'osd.141', ...]

It all seems to boil down to host names from `ceph orch host ls` not
matching with other configurations.

ceph orch host ls
HOST                                ADDR          LABELS STATUS
osd-mirror-1.our.domain.org<http://osd-mirror-1.our.domain.org>  172.16.62.22  rgw osd
osd-mirror-2.our.domain.org<http://osd-mirror-2.our.domain.org>  172.16.62.23  rgw osd
osd-mirror-3.our.domain.org<http://osd-mirror-3.our.domain.org>  172.16.62.24  rgw osd
osd-mirror-4.our.domain.org<http://osd-mirror-4.our.domain.org>  172.16.62.25  rgw mds osd
osd-mirror-5.our.domain.org<http://osd-mirror-5.our.domain.org>  172.16.62.32  rgw osd
osd-mirror-6.our.domain.org<http://osd-mirror-6.our.domain.org>  172.16.62.33  rgw mds osd

hostname
osd-mirror-6

hostname -f
osd-mirror-6.our.domain.org<http://osd-mirror-6.our.domain.org>

0|0[root@osd-mirror-6 ~]# ceph mon metadata | grep "\"hostname\""
         "hostname": "osd-mirror-1",
         "hostname": "osd-mirror-3",
         "hostname": "osd-mirror-2",

0|1[root@osd-mirror-6 ~]# ceph mgr metadata | grep "\"hostname\""
         "hostname": "osd-mirror-1",
         "hostname": "osd-mirror-3",
         "hostname": "osd-mirror-4",
         "hostname": "osd-mirror-5",


The documentation states, that "cephadm demands that the name of host
given via `ceph orch host add` equals the output of `hostname` on remote
hosts.".

https://docs.ceph.com/en/latest/cephadm/host-management/#fully-qualified-domain-names-vs-bare-host-names

https://docs.ceph.com/en/octopus/cephadm/concepts/?#fully-qualified-domain-names-vs-bare-host-names

But it seems our cluster wasn't setup like this.

How can I now change the host names which were assigend when adding the
hosts with `ceph orch host add HOSTNAME`?

I can't seem to find any documentation on changing the host names which
are listed by `ceph orch host ls`.
All I can find is related to changing the actual name of the host in the
system.
The crush map also just contains the bare host names.
So, where are these FQDN names actually registered?

Thank you for help.

Best regards,
Mathias
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>
To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx>

--
Mathias Kuhring

Dr. rer. nat.
Bioinformatician
HPC & Core Unit Bioinformatics
Berlin Institute of Health at Charité (BIH)

E-Mail:  mathias.kuhring@xxxxxxxxxxxxxx<mailto:mathias.kuhring@xxxxxxxxxxxxxx>
Mobile: +49 172 3475576

--
Mathias Kuhring

Dr. rer. nat.
Bioinformatician
HPC & Core Unit Bioinformatics
Berlin Institute of Health at Charité (BIH)

E-Mail:  mathias.kuhring@xxxxxxxxxxxxxx<mailto:mathias.kuhring@xxxxxxxxxxxxxx>
Mobile: +49 172 3475576
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux