Hi, thank you Eugen and Tim. > did you fail the mgr? I think I didn't. > Or how exactly did you drain that host? ``` cephadm shell -- ceph orch host drain cephfs-cluster-node-2 cephadm shell -- ceph orch host rm cephfs-cluster-node-2 ``` > `ceph config-key get mgr/cephadm/host.cephfs-cluster-node-2 | jq` Error ENOENT: > Is the VM still reachable? No. It wasn't reachable for majority of the time the warning was displayed. > Remove the directory so cephadm forgets about it. What directory do you mean? I have since made the VM reachable again and removed it again a few times. I even tried removing the mon.X that was running on X while (and ceph knew about it) while ceph was complaining that mon.X is running on cephfs-cluster-node-2. I disabled and re-enabled the stray warnings. I found some custom config pertaining to cephfs-cluster-node-0 (which `health detail` doesn't report as stray; namely: `osd host:cephfs-cluster-node-0 basic osd_memory_target 5829100748`). I tried enabling mgr module cli, to query list_servers, which I believe is behind what `health detail` queries, but ceph refused to enable that module. So I queried restful `/server`. I ran `grep -R cephfs-cluster-node-2` in /var/lib/ceph*/ and found a few logs and binary files that matched, among them some of the osd dirs. I tried to drop one of the osds and re-add it later. None of these things seemed to have any immediate effect. But suddenly after days of warning about stray host with a daemon (that does not even belong to that host), the warning disappeared. I am afraid I may have done some other things out of desperation that I do not recall at the moment. I hope the stray host + stray daemon warning does not return. Thanks again! Jakub On Mon, 18 Nov 2024 at 16:11, Eugen Block <eblock@xxxxxx> wrote: > Hi, > > just to be safe, did you fail the mgr? If not, try 'ceph mgr fail' and > see if it still reports that information. It sounds like you didn't > clean up your virtual MON after you drained the OSDs. Or how exactly > did you drain that host? If you run 'ceph orch host drain {host}' the > orchestrator will remove all daemons, not only OSDs. I assume that > there are still some entries in the ceph config-key database where the > 'device ls-by-host' output comes from, I haven't verified that though. > Is this key present? > > ceph config-key get mgr/cephadm/host.cephfs-cluster-node-2 | jq > > Is the VM still reachable? If it is, check the MON directory under > /var/lib/ceph/{FSID}/mon.X and remove it. You can also first check > with 'cephadm ls --no-detail' if cephadm believes that there's a MON > daemon located. Remove the directory so cephadm forgets about it. > > Zitat von Jakub Daniel <jakub.daniel@xxxxxxxxx>: > > > I have found these two threads > > > > > https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/VHZ7IJ7PAL7L2INLSHNVYY7V7ZCXD46G/#TSWERUMAEEGZPSYXG6PSS4YMRXPP3L63 > > > > > https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/NG5QVRTVCLLYNLK56CSYLIPE4WBFXS5U/#HJDBAJFX27KATC4WV2MKGLVGLN2HTWWD > > > > but I didn't exactly figure out what to do. I've since found remnants of > > removed cephfs-cluster-node-{0,1,2} in crush map buckets. Which I removed > > with no effect on health detail. I found out that ceph dashboard lists > the > > non-existent cephfs-cluster-node-2 among ceph hosts (while orch host ls > > doesn't). On the other hand, device ls-by-host cephfs-cluster-node-2 > lists > > an entry with device name coinciding with a drive on the live X host, > with > > daemons listing the mon.X. Meanwhile device ls-by-host X lists among > other > > things the exact same entry with the difference that in the dev column > > there is the actual device nvme0n1 whereas with cephfs-cluster-node-2 the > > column was empty > > > > root@X:~# cephadm shell -- ceph device ls-by-host cephfs-cluster-node-2 > > DEVICE DEV DAEMONS EXPECTED FAILURE > > Samsung_SSD_970_PRO_1TB_S462NF0M310269L mon.X > > > > root@X:~# cephadm shell -- ceph device ls-by-host X > > DEVICE DEV DAEMONS > > EXPECTED FAILURE > > Samsung_SSD_850_EVO_250GB_S2R6NX0J423123P sdb osd.3 > > Samsung_SSD_860_EVO_1TB_S3Z9NY0M431048H mon.Y > > Samsung_SSD_970_PRO_1TB_S462NF0M310269L nvme0n1 mon.X > > > > This output is confusing since it lists mon.Y also when asked for host X. > > > > I will continue investigating. If anyone has any hints what to try or > where > > to look, I would be very grateful. > > > > Jakub > > > > On Sun, 17 Nov 2024, 15:34 Tim Holloway, <timh@xxxxxxxxxxxxx> wrote: > > > >> I think I can count 5 sources that Ceph can query to > >> report/display/control its resources. > >> > >> 1. The /dec/ceph/ceph.conf file. Mostly supplanted bt the Ceph > >> configuration database. > >> > >> 2. The ceph configuration database. A namelesskey/value store internal > >> to a ceph filesystem. It's distributed (no fixed location), accessed by > >> Ceph commands and APIs/ > >> > >> 3. Legacy Ceph resources/ Stuff found under a host's /var/lib/ceph > >> directory. > >> > >> 4. Managed Ceph resources. Stuff found under a host's > >> /var/lib/ceph/{fsid} diirectory. > >> > >> 5. The live machine state of Ceph. Since this not only can vary from > >> host to host, but also service to service, I don't think that this is > >> considered to be an authoritative source of information. > >> > >> Compounding this is that current releases of Ceph can all too easily > >> end up in a "forbidden" state where you may have, for example a legacy > >> OSD.6 and a managed OSD.6 on the same host. In such a case, system is > >> generally operable, but functionally corrupt and ideally should be > >> corrected to remove the redundant resource. > >> > >> The real issue is that depending on what Ceph interface you're querying > >> (or "ceph health" is querying!), you don't always get your answer from > >> a single authoritative source, so you'll get conflicting results and > >> annoying error reports. The "stray daemon" condition is an especially > >> egregious example of this, and it's not only possible because of a > >> false detection from one of the above sources, but also, I think can > >> come from "dead" daemons being referenced in CRUSH. > >> > >> You might want to run through this lists's history for "phantom host" > >> postings made by me back around this past June because I was absolutely > >> plagued with them. Eugen Block helped me eventually purge them all. > >> > >> Regards, > >> Tim > >> > >> On Sat, 2024-11-16 at 21:42 +0100, Jakub Daniel wrote: > >>> Hello, > >>> > >>> I'm pretty new to ceph deployment. I have setup my first cephfs > >>> cluster > >>> using cephadm. Initially, I deployed ceph in 3 virtualbox instances > >>> that I > >>> called cephfs-cluster-node-{0, 1, 2} just to test things. Later, I > >>> added 5 > >>> more real hardware nodes. Later I decided I'd remove the > >>> virtualboxes, so I > >>> drained the osds and removed the hosts. Suddenly, ceph status detail > >>> started reporting > >>> > >>> HEALTH_WARN 1 stray host(s) with 1 daemon(s) not managed by cephadm > >>> [WRN] > >>> CEPHADM_STRAY_HOST: 1 stray host(s) with 1 daemon(s) not managed by > >>> cephadm > >>> stray host cephfs-cluster-node-2 has 1 stray daemons: ['mon.X'] > >>> > >>> The cephfs-cluster-node-2 is no longer listed among hosts, it is (and > >>> has > >>> been for tens of hours) offline (powered down). The mon.X doesn't > >>> even > >>> belong to that node, it is one of the real hardware nodes. I am > >>> unaware of > >>> mon.X ever running on cephfs-cluster-node-2 (never noticed it among > >>> systemd > >>> units). > >>> > >>> Where does cephadm shell -- ceph status detail come to the conclusion > >>> there > >>> is something stray? How can I address this? > >>> > >>> Thank you for any insights > >>> Jakub > >>> _______________________________________________ > >>> ceph-users mailing list -- ceph-users@xxxxxxx > >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx > >> > >> > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx