Re: osd removal leaves 'stray daemon'

Tim Holloway <timh@xxxxxxxxxxxxx> · Thu, 7 Nov 2024 17:37:21 -0500

I admit I don't follow what the exact problem is, but I wanted to point 
out that as long as there are ANY OSD metadata files on a machine, some 
(but not always all) ceph commands will consider there to be an OSD there.

To completely eradicate an OSD, I believe that (per Eugen Block) you 
also have to set the OSD's crush weights to 0.

I had a heck of a time flushing out crud a few months back. I learned a 
lot. Whether I wanted to or not.

Tim

On 11/7/24 16:57, Frédéric Nass wrote:
Hi Tim,

I see what you're referring to, but it doesn't apply since there's actually **no** stray daemons, that is **no** ghost process on any hosts trying to start.

Here we're talking about unexpected behavior, most likely a bug.

Regards,
Frédéric.

----- Le 7 Nov 24, à 21:08, Tim Holloway timh@xxxxxxxxxxxxx a écrit :

You can get this sort of behaviour because different Ceph subsystems get
their information from different places instead of having an
authoritative source of information.

Specifically, Ceph may look directly at:

A) Its configuration database

B) Systemd units running on the OSD host

C) Containers running ceph modules.

The problem is especially likely if you've managed to end up running the
same OSD number as both legacy (/var/lib/ceph/osd.x) and Manager
(/var/lib/ceph/{fsid}/ceph).

If you have a dual-defined OSD, the cleanest approach seems to be to
stop ceph on the bad machine and manually delete the /var/lib/ceph/osd.x
directory. You may need to delete a systemd unit file for that OSD.

You cannot delete the systemd unit for Managed OSDs. It's dynamically
created when the system comes up and will simply re-create itself. Which
is why it's easier to purge the artefacts of a legacy OSD.

    Tim

On 11/7/24 10:28, Frédéric Nass wrote:
Hi,

We're encountering this unexpected behavior as well. This tracker [1] was
created 4 months ago.

Regards,
Frédéric.

[1] https://tracker.ceph.com/issues/67018

----- Le 6 Déc 22, à 8:41, Holger Naundorf naundorf@xxxxxxxxxxxxxx a écrit :

Hello,
a mgr failover did not change the situation - the osd still shows up in
the 'ceph node ls' - I assume that this is more or less 'working as
intended' as I did ask for the OSD to be kept in the CRUSH map to be
replacd later - but as we are still not so experienced with Ceph here I
wanted to get some input from other sites.

Regards,
Holger

On 30.11.22 16:28, Adam King wrote:
I typically don't see this when I do OSD replacement. If you do a mgr
failover ("ceph mgr fail") and wait a few minutes does this still show
up? The stray daemon/host warning is roughly equivalent to comparing the
daemons in `ceph node ls` and `ceph orch ps` and seeing if there's
anything in the former but not the latter. Sometimes I have seen the mgr
will have some out of data info in the node ls and a failover will
refresh it.

On Fri, Nov 25, 2022 at 6:07 AM Holger Naundorf <naundorf@xxxxxxxxxxxxxx
<mailto:naundorf@xxxxxxxxxxxxxx>> wrote:

      Hello,
      I have a question about osd removal/replacement:

      I just removed an osd where the disk was still running but had read
      errors, leading to failed deep scrubs - as the intent is to replace
      this
      as soon as we manage to get a spare I removed it with the
      '--replace' flag:

      # ceph orch osd rm 224 --replace

      After all placement groups are evacuated I now have 1 osd down/out
      and showing as 'destroyed':

      # ceph osd tree
      ID   CLASS  WEIGHT      TYPE NAME        STATUS     REWEIGHT  PRI-AFF
      (...)
      214    hdd    14.55269          osd.214         up   1.00000  1.00000
      224    hdd    14.55269          osd.224  destroyed         0  1.00000
      234    hdd    14.55269          osd.234         up   1.00000  1.00000
      (...)

      All as expected - but now the health check complains that the
      (destroyed) osd is not managed:

      # ceph health detail
      HEALTH_WARN 1 stray daemon(s) not managed by cephadm
      [WRN] CEPHADM_STRAY_DAEMON: 1 stray daemon(s) not managed by cephadm
            stray daemon osd.224 on host ceph19 not managed by cephadm

      Is this expected behaviour and I have to live with the yellow check
      until we get a replacement disk and recreate the osd or did something
      not finish correctly?

      Regards,
      Holger

      --
      Dr. Holger Naundorf
      Christian-Albrechts-Universität zu Kiel
      Rechenzentrum / HPC / Server und Storage
      Tel: +49 431 880-1990
      Fax:  +49 431 880-1523
      naundorf@xxxxxxxxxxxxxx <mailto:naundorf@xxxxxxxxxxxxxx>
      _______________________________________________
      ceph-users mailing list -- ceph-users@xxxxxxx
      <mailto:ceph-users@xxxxxxx>
      To unsubscribe send an email to ceph-users-leave@xxxxxxx
      <mailto:ceph-users-leave@xxxxxxx>

--
Dr. Holger Naundorf
Christian-Albrechts-Universität zu Kiel
Rechenzentrum / HPC / Server und Storage
Tel: +49 431 880-1990
Fax:  +49 431 880-1523
naundorf@xxxxxxxxxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx