Re: Stray monitor

Tim Holloway <timh@xxxxxxxxxxxxx> · Sun, 17 Nov 2024 09:34:55 -0500

I think I can count 5 sources that Ceph can query to
report/display/control its resources.

1. The /dec/ceph/ceph.conf file. Mostly supplanted bt the Ceph
configuration database.

2. The ceph configuration database. A namelesskey/value store internal
to a ceph filesystem. It's distributed (no fixed location), accessed by
Ceph commands and APIs/

3. Legacy Ceph resources/ Stuff found under a host's /var/lib/ceph
directory.

4. Managed Ceph resources. Stuff found under a host's
/var/lib/ceph/{fsid} diirectory.

5. The live machine state of Ceph. Since this not only can vary from
host to host, but also service to service, I don't think that this is
considered to be an authoritative source of information.

Compounding this is that current releases of Ceph can all too easily
end up in a "forbidden" state where you may have, for example a legacy
OSD.6 and a managed OSD.6 on the same host. In such a case, system is
generally operable, but functionally corrupt and ideally should be
corrected to remove the redundant resource.

The real issue is that depending on what Ceph interface you're querying
(or "ceph health" is querying!), you don't always get your answer from
a single authoritative source, so you'll get conflicting results and
annoying error reports. The "stray daemon" condition is an especially
egregious example of this, and it's not only possible because of a
false detection from one of the above sources, but also, I think can
come from "dead" daemons being referenced in CRUSH.

You might want to run through this lists's history for "phantom host"
postings made by me back around this past June because I was absolutely
plagued with them. Eugen Block helped me eventually purge them all.

   Regards,
      Tim

On Sat, 2024-11-16 at 21:42 +0100, Jakub Daniel wrote:
> Hello,
> 
> I'm pretty new to ceph deployment. I have setup my first cephfs
> cluster
> using cephadm. Initially, I deployed ceph in 3 virtualbox instances
> that I
> called cephfs-cluster-node-{0, 1, 2} just to test things. Later, I
> added 5
> more real hardware nodes. Later I decided I'd remove the
> virtualboxes, so I
> drained the osds and removed the hosts. Suddenly, ceph status detail
> started reporting
> 
> HEALTH_WARN 1 stray host(s) with 1 daemon(s) not managed by cephadm
> [WRN]
> CEPHADM_STRAY_HOST: 1 stray host(s) with 1 daemon(s) not managed by
> cephadm
>     stray host cephfs-cluster-node-2 has 1 stray daemons: ['mon.X']
> 
> The cephfs-cluster-node-2 is no longer listed among hosts, it is (and
> has
> been for tens of hours) offline (powered down). The mon.X doesn't
> even
> belong to that node, it is one of the real hardware nodes. I am
> unaware of
> mon.X ever running on cephfs-cluster-node-2 (never noticed it among
> systemd
> units).
> 
> Where does cephadm shell -- ceph status detail come to the conclusion
> there
> is something stray? How can I address this?
> 
> Thank you for any insights
> Jakub
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx