Re: osd removal leaves 'stray daemon'

Frédéric Nass <frederic.nass@xxxxxxxxxxxxxxxx> · Thu, 7 Nov 2024 22:57:05 +0100 (CET)

Hi Tim,

I see what you're referring to, but it doesn't apply since there's actually **no** stray daemons, that is **no** ghost process on any hosts trying to start.

Here we're talking about unexpected behavior, most likely a bug.

Regards,
Frédéric.

----- Le 7 Nov 24, à 21:08, Tim Holloway timh@xxxxxxxxxxxxx a écrit :

> You can get this sort of behaviour because different Ceph subsystems get
> their information from different places instead of having an
> authoritative source of information.
> 
> Specifically, Ceph may look directly at:
> 
> A) Its configuration database
> 
> B) Systemd units running on the OSD host
> 
> C) Containers running ceph modules.
> 
> The problem is especially likely if you've managed to end up running the
> same OSD number as both legacy (/var/lib/ceph/osd.x) and Manager
> (/var/lib/ceph/{fsid}/ceph).
> 
> If you have a dual-defined OSD, the cleanest approach seems to be to
> stop ceph on the bad machine and manually delete the /var/lib/ceph/osd.x
> directory. You may need to delete a systemd unit file for that OSD.
> 
> You cannot delete the systemd unit for Managed OSDs. It's dynamically
> created when the system comes up and will simply re-create itself. Which
> is why it's easier to purge the artefacts of a legacy OSD.
> 
>    Tim
> 
> 
> On 11/7/24 10:28, Frédéric Nass wrote:
>> Hi,
>>
>> We're encountering this unexpected behavior as well. This tracker [1] was
>> created 4 months ago.
>>
>> Regards,
>> Frédéric.
>>
>> [1] https://tracker.ceph.com/issues/67018
>>
>> ----- Le 6 Déc 22, à 8:41, Holger Naundorf naundorf@xxxxxxxxxxxxxx a écrit :
>>
>>> Hello,
>>> a mgr failover did not change the situation - the osd still shows up in
>>> the 'ceph node ls' - I assume that this is more or less 'working as
>>> intended' as I did ask for the OSD to be kept in the CRUSH map to be
>>> replacd later - but as we are still not so experienced with Ceph here I
>>> wanted to get some input from other sites.
>>>
>>> Regards,
>>> Holger
>>>
>>> On 30.11.22 16:28, Adam King wrote:
>>>> I typically don't see this when I do OSD replacement. If you do a mgr
>>>> failover ("ceph mgr fail") and wait a few minutes does this still show
>>>> up? The stray daemon/host warning is roughly equivalent to comparing the
>>>> daemons in `ceph node ls` and `ceph orch ps` and seeing if there's
>>>> anything in the former but not the latter. Sometimes I have seen the mgr
>>>> will have some out of data info in the node ls and a failover will
>>>> refresh it.
>>>>
>>>> On Fri, Nov 25, 2022 at 6:07 AM Holger Naundorf <naundorf@xxxxxxxxxxxxxx
>>>> <mailto:naundorf@xxxxxxxxxxxxxx>> wrote:
>>>>
>>>>      Hello,
>>>>      I have a question about osd removal/replacement:
>>>>
>>>>      I just removed an osd where the disk was still running but had read
>>>>      errors, leading to failed deep scrubs - as the intent is to replace
>>>>      this
>>>>      as soon as we manage to get a spare I removed it with the
>>>>      '--replace' flag:
>>>>
>>>>      # ceph orch osd rm 224 --replace
>>>>
>>>>      After all placement groups are evacuated I now have 1 osd down/out
>>>>      and showing as 'destroyed':
>>>>
>>>>      # ceph osd tree
>>>>      ID   CLASS  WEIGHT      TYPE NAME        STATUS     REWEIGHT  PRI-AFF
>>>>      (...)
>>>>      214    hdd    14.55269          osd.214         up   1.00000  1.00000
>>>>      224    hdd    14.55269          osd.224  destroyed         0  1.00000
>>>>      234    hdd    14.55269          osd.234         up   1.00000  1.00000
>>>>      (...)
>>>>
>>>>      All as expected - but now the health check complains that the
>>>>      (destroyed) osd is not managed:
>>>>
>>>>      # ceph health detail
>>>>      HEALTH_WARN 1 stray daemon(s) not managed by cephadm
>>>>      [WRN] CEPHADM_STRAY_DAEMON: 1 stray daemon(s) not managed by cephadm
>>>>            stray daemon osd.224 on host ceph19 not managed by cephadm
>>>>
>>>>      Is this expected behaviour and I have to live with the yellow check
>>>>      until we get a replacement disk and recreate the osd or did something
>>>>      not finish correctly?
>>>>
>>>>      Regards,
>>>>      Holger
>>>>
>>>>      --
>>>>      Dr. Holger Naundorf
>>>>      Christian-Albrechts-Universität zu Kiel
>>>>      Rechenzentrum / HPC / Server und Storage
>>>>      Tel: +49 431 880-1990
>>>>      Fax:  +49 431 880-1523
>>>>      naundorf@xxxxxxxxxxxxxx <mailto:naundorf@xxxxxxxxxxxxxx>
>>>>      _______________________________________________
>>>>      ceph-users mailing list -- ceph-users@xxxxxxx
>>>>      <mailto:ceph-users@xxxxxxx>
>>>>      To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>>      <mailto:ceph-users-leave@xxxxxxx>
>>>>
>>> --
>>> Dr. Holger Naundorf
>>> Christian-Albrechts-Universität zu Kiel
>>> Rechenzentrum / HPC / Server und Storage
>>> Tel: +49 431 880-1990
>>> Fax:  +49 431 880-1523
>>> naundorf@xxxxxxxxxxxxxx
>>>
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx