Hi,
just to be safe, did you fail the mgr? If not, try 'ceph mgr fail' and
see if it still reports that information. It sounds like you didn't
clean up your virtual MON after you drained the OSDs. Or how exactly
did you drain that host? If you run 'ceph orch host drain {host}' the
orchestrator will remove all daemons, not only OSDs. I assume that
there are still some entries in the ceph config-key database where the
'device ls-by-host' output comes from, I haven't verified that though.
Is this key present?
ceph config-key get mgr/cephadm/host.cephfs-cluster-node-2 | jq
Is the VM still reachable? If it is, check the MON directory under
/var/lib/ceph/{FSID}/mon.X and remove it. You can also first check
with 'cephadm ls --no-detail' if cephadm believes that there's a MON
daemon located. Remove the directory so cephadm forgets about it.
Zitat von Jakub Daniel <jakub.daniel@xxxxxxxxx>:
I have found these two threads
https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/VHZ7IJ7PAL7L2INLSHNVYY7V7ZCXD46G/#TSWERUMAEEGZPSYXG6PSS4YMRXPP3L63
https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/NG5QVRTVCLLYNLK56CSYLIPE4WBFXS5U/#HJDBAJFX27KATC4WV2MKGLVGLN2HTWWD
but I didn't exactly figure out what to do. I've since found remnants of
removed cephfs-cluster-node-{0,1,2} in crush map buckets. Which I removed
with no effect on health detail. I found out that ceph dashboard lists the
non-existent cephfs-cluster-node-2 among ceph hosts (while orch host ls
doesn't). On the other hand, device ls-by-host cephfs-cluster-node-2 lists
an entry with device name coinciding with a drive on the live X host, with
daemons listing the mon.X. Meanwhile device ls-by-host X lists among other
things the exact same entry with the difference that in the dev column
there is the actual device nvme0n1 whereas with cephfs-cluster-node-2 the
column was empty
root@X:~# cephadm shell -- ceph device ls-by-host cephfs-cluster-node-2
DEVICE DEV DAEMONS EXPECTED FAILURE
Samsung_SSD_970_PRO_1TB_S462NF0M310269L mon.X
root@X:~# cephadm shell -- ceph device ls-by-host X
DEVICE DEV DAEMONS
EXPECTED FAILURE
Samsung_SSD_850_EVO_250GB_S2R6NX0J423123P sdb osd.3
Samsung_SSD_860_EVO_1TB_S3Z9NY0M431048H mon.Y
Samsung_SSD_970_PRO_1TB_S462NF0M310269L nvme0n1 mon.X
This output is confusing since it lists mon.Y also when asked for host X.
I will continue investigating. If anyone has any hints what to try or where
to look, I would be very grateful.
Jakub
On Sun, 17 Nov 2024, 15:34 Tim Holloway, <timh@xxxxxxxxxxxxx> wrote:
I think I can count 5 sources that Ceph can query to
report/display/control its resources.
1. The /dec/ceph/ceph.conf file. Mostly supplanted bt the Ceph
configuration database.
2. The ceph configuration database. A namelesskey/value store internal
to a ceph filesystem. It's distributed (no fixed location), accessed by
Ceph commands and APIs/
3. Legacy Ceph resources/ Stuff found under a host's /var/lib/ceph
directory.
4. Managed Ceph resources. Stuff found under a host's
/var/lib/ceph/{fsid} diirectory.
5. The live machine state of Ceph. Since this not only can vary from
host to host, but also service to service, I don't think that this is
considered to be an authoritative source of information.
Compounding this is that current releases of Ceph can all too easily
end up in a "forbidden" state where you may have, for example a legacy
OSD.6 and a managed OSD.6 on the same host. In such a case, system is
generally operable, but functionally corrupt and ideally should be
corrected to remove the redundant resource.
The real issue is that depending on what Ceph interface you're querying
(or "ceph health" is querying!), you don't always get your answer from
a single authoritative source, so you'll get conflicting results and
annoying error reports. The "stray daemon" condition is an especially
egregious example of this, and it's not only possible because of a
false detection from one of the above sources, but also, I think can
come from "dead" daemons being referenced in CRUSH.
You might want to run through this lists's history for "phantom host"
postings made by me back around this past June because I was absolutely
plagued with them. Eugen Block helped me eventually purge them all.
Regards,
Tim
On Sat, 2024-11-16 at 21:42 +0100, Jakub Daniel wrote:
Hello,
I'm pretty new to ceph deployment. I have setup my first cephfs
cluster
using cephadm. Initially, I deployed ceph in 3 virtualbox instances
that I
called cephfs-cluster-node-{0, 1, 2} just to test things. Later, I
added 5
more real hardware nodes. Later I decided I'd remove the
virtualboxes, so I
drained the osds and removed the hosts. Suddenly, ceph status detail
started reporting
HEALTH_WARN 1 stray host(s) with 1 daemon(s) not managed by cephadm
[WRN]
CEPHADM_STRAY_HOST: 1 stray host(s) with 1 daemon(s) not managed by
cephadm
stray host cephfs-cluster-node-2 has 1 stray daemons: ['mon.X']
The cephfs-cluster-node-2 is no longer listed among hosts, it is (and
has
been for tens of hours) offline (powered down). The mon.X doesn't
even
belong to that node, it is one of the real hardware nodes. I am
unaware of
mon.X ever running on cephfs-cluster-node-2 (never noticed it among
systemd
units).
Where does cephadm shell -- ceph status detail come to the conclusion
there
is something stray? How can I address this?
Thank you for any insights
Jakub
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx