Re: Stray monitor

Jakub Daniel <jakub.daniel@xxxxxxxxx> · Mon, 18 Nov 2024 09:06:14 +0100

I have found these two threads

https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/VHZ7IJ7PAL7L2INLSHNVYY7V7ZCXD46G/#TSWERUMAEEGZPSYXG6PSS4YMRXPP3L63

https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/NG5QVRTVCLLYNLK56CSYLIPE4WBFXS5U/#HJDBAJFX27KATC4WV2MKGLVGLN2HTWWD

but I didn't exactly figure out what to do. I've since found remnants of
removed cephfs-cluster-node-{0,1,2} in crush map buckets. Which I removed
with no effect on health detail. I found out that ceph dashboard lists the
non-existent cephfs-cluster-node-2 among ceph hosts (while orch host ls
doesn't). On the other hand, device ls-by-host cephfs-cluster-node-2 lists
an entry with device name coinciding with a drive on the live X host, with
daemons listing the mon.X. Meanwhile device ls-by-host X lists among other
things the exact same entry with the difference that in the dev column
there is the actual device nvme0n1 whereas with cephfs-cluster-node-2 the
column was empty

root@X:~# cephadm shell -- ceph device ls-by-host cephfs-cluster-node-2
DEVICE                                   DEV  DAEMONS  EXPECTED FAILURE
Samsung_SSD_970_PRO_1TB_S462NF0M310269L       mon.X

root@X:~# cephadm shell -- ceph device ls-by-host X
DEVICE                                     DEV      DAEMONS
EXPECTED FAILURE
Samsung_SSD_850_EVO_250GB_S2R6NX0J423123P  sdb      osd.3
Samsung_SSD_860_EVO_1TB_S3Z9NY0M431048H             mon.Y
Samsung_SSD_970_PRO_1TB_S462NF0M310269L    nvme0n1  mon.X

This output is confusing since it lists mon.Y also when asked for host X.

I will continue investigating. If anyone has any hints what to try or where
to look, I would be very grateful.

Jakub

On Sun, 17 Nov 2024, 15:34 Tim Holloway, <timh@xxxxxxxxxxxxx> wrote:

> I think I can count 5 sources that Ceph can query to
> report/display/control its resources.
>
> 1. The /dec/ceph/ceph.conf file. Mostly supplanted bt the Ceph
> configuration database.
>
> 2. The ceph configuration database. A namelesskey/value store internal
> to a ceph filesystem. It's distributed (no fixed location), accessed by
> Ceph commands and APIs/
>
> 3. Legacy Ceph resources/ Stuff found under a host's /var/lib/ceph
> directory.
>
> 4. Managed Ceph resources. Stuff found under a host's
> /var/lib/ceph/{fsid} diirectory.
>
> 5. The live machine state of Ceph. Since this not only can vary from
> host to host, but also service to service, I don't think that this is
> considered to be an authoritative source of information.
>
> Compounding this is that current releases of Ceph can all too easily
> end up in a "forbidden" state where you may have, for example a legacy
> OSD.6 and a managed OSD.6 on the same host. In such a case, system is
> generally operable, but functionally corrupt and ideally should be
> corrected to remove the redundant resource.
>
> The real issue is that depending on what Ceph interface you're querying
> (or "ceph health" is querying!), you don't always get your answer from
> a single authoritative source, so you'll get conflicting results and
> annoying error reports. The "stray daemon" condition is an especially
> egregious example of this, and it's not only possible because of a
> false detection from one of the above sources, but also, I think can
> come from "dead" daemons being referenced in CRUSH.
>
> You might want to run through this lists's history for "phantom host"
> postings made by me back around this past June because I was absolutely
> plagued with them. Eugen Block helped me eventually purge them all.
>
>    Regards,
>       Tim
>
> On Sat, 2024-11-16 at 21:42 +0100, Jakub Daniel wrote:
> > Hello,
> >
> > I'm pretty new to ceph deployment. I have setup my first cephfs
> > cluster
> > using cephadm. Initially, I deployed ceph in 3 virtualbox instances
> > that I
> > called cephfs-cluster-node-{0, 1, 2} just to test things. Later, I
> > added 5
> > more real hardware nodes. Later I decided I'd remove the
> > virtualboxes, so I
> > drained the osds and removed the hosts. Suddenly, ceph status detail
> > started reporting
> >
> > HEALTH_WARN 1 stray host(s) with 1 daemon(s) not managed by cephadm
> > [WRN]
> > CEPHADM_STRAY_HOST: 1 stray host(s) with 1 daemon(s) not managed by
> > cephadm
> >     stray host cephfs-cluster-node-2 has 1 stray daemons: ['mon.X']
> >
> > The cephfs-cluster-node-2 is no longer listed among hosts, it is (and
> > has
> > been for tens of hours) offline (powered down). The mon.X doesn't
> > even
> > belong to that node, it is one of the real hardware nodes. I am
> > unaware of
> > mon.X ever running on cephfs-cluster-node-2 (never noticed it among
> > systemd
> > units).
> >
> > Where does cephadm shell -- ceph status detail come to the conclusion
> > there
> > is something stray? How can I address this?
> >
> > Thank you for any insights
> > Jakub
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx