Re: osd removal leaves 'stray daemon'

Tim Holloway <timh@xxxxxxxxxxxxx> · Thu, 7 Nov 2024 15:08:42 -0500

You can get this sort of behaviour because different Ceph subsystems get 
their information from different places instead of having an 
authoritative source of information.

Specifically, Ceph may look directly at:

A) Its configuration database

B) Systemd units running on the OSD host

C) Containers running ceph modules.

The problem is especially likely if you've managed to end up running the 
same OSD number as both legacy (/var/lib/ceph/osd.x) and Manager 
(/var/lib/ceph/{fsid}/ceph).

If you have a dual-defined OSD, the cleanest approach seems to be to 
stop ceph on the bad machine and manually delete the /var/lib/ceph/osd.x 
directory. You may need to delete a systemd unit file for that OSD.

You cannot delete the systemd unit for Managed OSDs. It's dynamically 
created when the system comes up and will simply re-create itself. Which 
is why it's easier to purge the artefacts of a legacy OSD.

   Tim

On 11/7/24 10:28, Frédéric Nass wrote:
Hi,

We're encountering this unexpected behavior as well. This tracker [1] was created 4 months ago.

Regards,
Frédéric.

[1] https://tracker.ceph.com/issues/67018

----- Le 6 Déc 22, à 8:41, Holger Naundorf naundorf@xxxxxxxxxxxxxx a écrit :

Hello,
a mgr failover did not change the situation - the osd still shows up in
the 'ceph node ls' - I assume that this is more or less 'working as
intended' as I did ask for the OSD to be kept in the CRUSH map to be
replacd later - but as we are still not so experienced with Ceph here I
wanted to get some input from other sites.

Regards,
Holger

On 30.11.22 16:28, Adam King wrote:
I typically don't see this when I do OSD replacement. If you do a mgr
failover ("ceph mgr fail") and wait a few minutes does this still show
up? The stray daemon/host warning is roughly equivalent to comparing the
daemons in `ceph node ls` and `ceph orch ps` and seeing if there's
anything in the former but not the latter. Sometimes I have seen the mgr
will have some out of data info in the node ls and a failover will
refresh it.

On Fri, Nov 25, 2022 at 6:07 AM Holger Naundorf <naundorf@xxxxxxxxxxxxxx
<mailto:naundorf@xxxxxxxxxxxxxx>> wrote:

     Hello,
     I have a question about osd removal/replacement:

     I just removed an osd where the disk was still running but had read
     errors, leading to failed deep scrubs - as the intent is to replace
     this
     as soon as we manage to get a spare I removed it with the
     '--replace' flag:

     # ceph orch osd rm 224 --replace

     After all placement groups are evacuated I now have 1 osd down/out
     and showing as 'destroyed':

     # ceph osd tree
     ID   CLASS  WEIGHT      TYPE NAME        STATUS     REWEIGHT  PRI-AFF
     (...)
     214    hdd    14.55269          osd.214         up   1.00000  1.00000
     224    hdd    14.55269          osd.224  destroyed         0  1.00000
     234    hdd    14.55269          osd.234         up   1.00000  1.00000
     (...)

     All as expected - but now the health check complains that the
     (destroyed) osd is not managed:

     # ceph health detail
     HEALTH_WARN 1 stray daemon(s) not managed by cephadm
     [WRN] CEPHADM_STRAY_DAEMON: 1 stray daemon(s) not managed by cephadm
           stray daemon osd.224 on host ceph19 not managed by cephadm

     Is this expected behaviour and I have to live with the yellow check
     until we get a replacement disk and recreate the osd or did something
     not finish correctly?

     Regards,
     Holger

     --
     Dr. Holger Naundorf
     Christian-Albrechts-Universität zu Kiel
     Rechenzentrum / HPC / Server und Storage
     Tel: +49 431 880-1990
     Fax:  +49 431 880-1523
     naundorf@xxxxxxxxxxxxxx <mailto:naundorf@xxxxxxxxxxxxxx>
     _______________________________________________
     ceph-users mailing list -- ceph-users@xxxxxxx
     <mailto:ceph-users@xxxxxxx>
     To unsubscribe send an email to ceph-users-leave@xxxxxxx
     <mailto:ceph-users-leave@xxxxxxx>

--
Dr. Holger Naundorf
Christian-Albrechts-Universität zu Kiel
Rechenzentrum / HPC / Server und Storage
Tel: +49 431 880-1990
Fax:  +49 431 880-1523
naundorf@xxxxxxxxxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx