Re: [ext] Re: Rename / change host names set with `ceph orch host add`

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Cephadm actually builds the list of daemons on the host by looking at
subdirectories in /var/lib/ceph/. "cephadm:v1" type daemons correspond to
directories within /var/lib/ceph/<fsid> while "legacy" daemons correspond
to directories of format /var/lib/ceph/<daemon-type>-<daemon-id> where
<daemon-type> is one of "mon", "osd", "mds", "mgr". So, in this case, I'm
guessing that host has a directory like "/var/lib/ceph/mon-osd-mirror-1".
To "remove" the daemon, you should just have to remove the directory.
Additionally, I will add that the inferring config issue itself was tracked
in https://tracker.ceph.com/issues/54571 and should be resolved as of
16.2.9 and 17.2.1. so hopefully removing these legacy daemon dirs won't be
necessary in the future.

Thanks,
  - Adam King

On Thu, Jun 23, 2022 at 6:42 AM Kuhring, Mathias <
mathias.kuhring@xxxxxxxxxxxxxx> wrote:

> Hey Adam,
>
> thanks again for your help.
>
> I finally got around to execute your suggested procedure. It went mostly
> find.
> Except when I renamed the last host, I ended up with a rogue mon.
> I assume a new mon was created on a different host while the last one was
> "out" of cephadm.
> And the remaining mon on the last host is now not cleaned up by cephadm
> (maybe due to being legacy).
>
> I got the following warnings (relevant sections):
> [WRN] CEPHADM_APPLY_SPEC_FAIL: Failed to apply 1 service(s):
> osd.all-available-devices
>     osd.all-available-devices: host osd-mirror-1 `cephadm ceph-volume`
> failed: cephadm exited with an error code: 1, stderr:Inferring config
> /var/lib/ceph/7efa00f9-182f-40f4-9136-d51895db1f0b/mon.osd-mirror-1/config
> ERROR: [Errno 2] No such file or directory:
> '/var/lib/ceph/7efa00f9-182f-40f4-9136-d51895db1f0b/mon.osd-mirror-1/config'
> [WRN] CEPHADM_REFRESH_FAILED: failed to probe daemons or devices
>     host osd-mirror-1 `cephadm ceph-volume` failed: cephadm exited with an
> error code: 1, stderr:Inferring config
> /var/lib/ceph/7efa00f9-182f-40f4-9136-d51895db1f0b/mon.osd-mirror-1/config
> ERROR: [Errno 2] No such file or directory:
> '/var/lib/ceph/7efa00f9-182f-40f4-9136-d51895db1f0b/mon.osd-mirror-1/config'
>
> The mon is not running and required anymore. It's not listed via
> `systemctl`, `ceph orch ps` nor `ceph status` anymore:
> mon:           3 daemons, quorum osd-mirror-3,osd-mirror-2,osd-mirror-6
> (age 3m)
>
> But cephadm is still aware of the late daemon and trying use (?) it. From
> `cephadm ls`:
>     {
>         "style": "legacy",
>         "name": "mon.osd-mirror-1",
>         "fsid": "7efa00f9-182f-40f4-9136-d51895db1f0b",
>         "systemd_unit": "ceph-mon@osd-mirror-1",
>         "enabled": false,
>         "state": "unknown",
>         "host_version": "15.2.14"
>     },
>
> Tried to remove it (which didn't help):
> cephadm rm-daemon --name mon.osd-mirror-1 --fsid
> 7efa00f9-182f-40f4-9136-d51895db1f0b --force --force-delete-data
>
> And then figured out cephadm is not supposed to remove legacy daemons:
> https://tracker.ceph.com/issues/45976
>
> Also tried some manual removal without success:
> 0|0[root@osd-mirror-1 ~]# service ceph -a stop mon.osd-mirror-1
> The service command supports only basic LSB actions (start, stop, restart,
> try-restart, reload, force-reload, status). For other actions, please try
> to use systemctl.
> 0|0[root@osd-mirror-1 ~]# ceph mon remove osd-mirror-1
> mon.osd-mirror-1 does not exist or has already been removed
>
> What other options do I have to remove this daemon?
> I.e. the rest of information cephadm keeps, resulting in it thinking the
> mon would be available?
>
> Thanks again for all your help.
>
> Best, Mathias
> On 5/20/2022 5:16 PM, Adam King wrote:
>
> To clarify a bit, "ceph orch host rm <hostname> --force" won't actually
> touch any of the daemons on the host. It just stops cephadm from managing
> the host. I.e. it won't add/remove daemons on the host. If you remove the
> host then re-add it with the new host name nothing should actually happen
> to the daemons there. The only possible exception is if you have services
> whose placement uses count and one of the daemons from that service is on
> the host being temporarily removed. It's possible it could try to deploy
> that daemon on another host in the interim. However, OSDs are never like
> that so there would never be any need for flags like no-out or no-backfill.
> The worst case would be it moving a mon or mgr around. If you make sure all
> the important services are deployed by label, explicit hosts etc. (just not
> count) then there should be no risk of any daemons moving at all and this
> is a pretty safe operation.
>
> On Fri, May 20, 2022 at 3:36 AM Kuhring, Mathias <
> mathias.kuhring@xxxxxxxxxxxxxx> wrote:
>
>> Hey Adam,
>>
>> thanks for your fast reply.
>>
>> That's a bit more invasive and risky than I was hoping for.
>> But if this is the only way, I guess we need to do this.
>>
>> Would it be advisable to put some maintenance flags like noout,
>> nobackfill, norebalance?
>> And maybe stop the ceph target on the host I'm re-adding to pause all
>> daemons?
>>
>> Best, Mathias
>> On 5/19/2022 8:14 PM, Adam King wrote:
>>
>> cephadm just takes the hostname given in the "ceph orch host add"
>> commands and assumes it won't change. The FQDN names (or whatever "ceph
>> orch host ls" shows in any scenario) are from whatever input was given in
>> those commands. Cephadm will even try to verify the hostname matches what
>> is given when adding the host. As for where it is stored, we keep that info
>> in the mon key store and it isn't meant to be manually updated (ceph
>> config-key get mgr/cephadm/inventory). Although, there have occasionally
>> been people running into issues related to a mismatch between an FQDN and a
>> shortname. There's no built-in command for changing a hostname because of
>> the expectation that it won't change. However, you should be able to fix
>> this by removing and re-adding the host. E.g. "ceph orch host rm
>> osd-mirror-1.our.domain.org" followed by "ceph orch host add
>> osd-mirror-1 172.16.62.22 --labels rgw --labels osd". If you're on a late
>> enough version that it requests you drain the host before we'll remove it
>> (it was some pacific dot release, don't remember which one) you can pass
>> --force to the host rm command. Generally it's not a good idea to remove
>> hosts from cephadm's control while there are still cephadm deployed daemons
>> on it like that but this is a special case. Anyway, removing and re-adding
>> the host is the only (reasonable) way to change what it has stored for the
>> hostname that I can remember.
>>
>> Let me know if that doesn't work,
>>  - Adam King
>>
>> On Thu, May 19, 2022 at 1:41 PM Kuhring, Mathias <
>> mathias.kuhring@xxxxxxxxxxxxxx> wrote:
>>
>>> Dear ceph users,
>>>
>>> one of our cluster is complaining about plenty of stray hosts and
>>> daemons. Pretty much all of them.
>>>
>>> [WRN] CEPHADM_STRAY_HOST: 6 stray host(s) with 280 daemon(s) not managed
>>> by cephadm
>>>      stray host osd-mirror-1 has 47 stray daemons:
>>> ['mgr.osd-mirror-1.ltmyyh', 'mon.osd-mirror-1', 'osd.1', ...]
>>>      stray host osd-mirror-2 has 46 stray daemons: ['mon.osd-mirror-2',
>>> 'osd.0', ...]
>>>      stray host osd-mirror-3 has 48 stray daemons:
>>> ['cephfs-mirror.osd-mirror-3.qzcuvv', 'mgr.osd-mirror-3',
>>> 'mon.osd-mirror-3', 'osd.101', ...]
>>>      stray host osd-mirror-4 has 47 stray daemons:
>>> ['mds.cephfs.osd-mirror-4.omjlxu', 'mgr.osd-mirror-4', 'osd.103', ...]
>>>      stray host osd-mirror-5 has 46 stray daemons: ['mgr.osd-mirror-5',
>>> 'osd.139', ...]
>>>      stray host osd-mirror-6 has 46 stray daemons:
>>> ['mds.cephfs.osd-mirror-6.hobjsy', 'osd.141', ...]
>>>
>>> It all seems to boil down to host names from `ceph orch host ls` not
>>> matching with other configurations.
>>>
>>> ceph orch host ls
>>> HOST                                ADDR          LABELS STATUS
>>> osd-mirror-1.our.domain.org  172.16.62.22  rgw osd
>>> osd-mirror-2.our.domain.org  172.16.62.23  rgw osd
>>> osd-mirror-3.our.domain.org  172.16.62.24  rgw osd
>>> osd-mirror-4.our.domain.org  172.16.62.25  rgw mds osd
>>> osd-mirror-5.our.domain.org  172.16.62.32  rgw osd
>>> osd-mirror-6.our.domain.org  172.16.62.33  rgw mds osd
>>>
>>> hostname
>>> osd-mirror-6
>>>
>>> hostname -f
>>> osd-mirror-6.our.domain.org
>>>
>>> 0|0[root@osd-mirror-6 ~]# ceph mon metadata | grep "\"hostname\""
>>>          "hostname": "osd-mirror-1",
>>>          "hostname": "osd-mirror-3",
>>>          "hostname": "osd-mirror-2",
>>>
>>> 0|1[root@osd-mirror-6 ~]# ceph mgr metadata | grep "\"hostname\""
>>>          "hostname": "osd-mirror-1",
>>>          "hostname": "osd-mirror-3",
>>>          "hostname": "osd-mirror-4",
>>>          "hostname": "osd-mirror-5",
>>>
>>>
>>> The documentation states, that "cephadm demands that the name of host
>>> given via `ceph orch host add` equals the output of `hostname` on remote
>>> hosts.".
>>>
>>>
>>> https://docs.ceph.com/en/latest/cephadm/host-management/#fully-qualified-domain-names-vs-bare-host-names
>>>
>>>
>>> https://docs.ceph.com/en/octopus/cephadm/concepts/?#fully-qualified-domain-names-vs-bare-host-names
>>>
>>> But it seems our cluster wasn't setup like this.
>>>
>>> How can I now change the host names which were assigend when adding the
>>> hosts with `ceph orch host add HOSTNAME`?
>>>
>>> I can't seem to find any documentation on changing the host names which
>>> are listed by `ceph orch host ls`.
>>> All I can find is related to changing the actual name of the host in the
>>> system.
>>> The crush map also just contains the bare host names.
>>> So, where are these FQDN names actually registered?
>>>
>>> Thank you for help.
>>>
>>> Best regards,
>>> Mathias
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>
>> --
>> Mathias Kuhring
>>
>> Dr. rer. nat.
>> Bioinformatician
>> HPC & Core Unit Bioinformatics
>> Berlin Institute of Health at Charité (BIH)
>>
>> E-Mail:  mathias.kuhring@xxxxxxxxxxxxxx
>> Mobile: +49 172 3475576
>>
>> --
> Mathias Kuhring
>
> Dr. rer. nat.
> Bioinformatician
> HPC & Core Unit Bioinformatics
> Berlin Institute of Health at Charité (BIH)
>
> E-Mail:  mathias.kuhring@xxxxxxxxxxxxxx
> Mobile: +49 172 3475576
>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux