Hi,
thank you Eugen and Tim.
did you fail the mgr?
I think I didn't.
Or how exactly did you drain that host?
```
cephadm shell -- ceph orch host drain cephfs-cluster-node-2
cephadm shell -- ceph orch host rm cephfs-cluster-node-2
```
`ceph config-key get mgr/cephadm/host.cephfs-cluster-node-2 | jq`
Error ENOENT:
Is the VM still reachable?
No. It wasn't reachable for majority of the time the warning was displayed.
Remove the directory so cephadm forgets about it.
What directory do you mean?
I have since made the VM reachable again and removed it again a few times.
I even tried removing the mon.X that was running on X while (and ceph knew
about it) while ceph was complaining that mon.X is running on
cephfs-cluster-node-2.
I disabled and re-enabled the stray warnings.
I found some custom config pertaining to cephfs-cluster-node-0 (which
`health detail` doesn't report as stray; namely: `osd
host:cephfs-cluster-node-0 basic osd_memory_target 5829100748`).
I tried enabling mgr module cli, to query list_servers, which I believe is
behind what `health detail` queries, but ceph refused to enable that module.
So I queried restful `/server`.
I ran `grep -R cephfs-cluster-node-2` in /var/lib/ceph*/ and found a few
logs and binary files that matched, among them some of the osd dirs.
I tried to drop one of the osds and re-add it later.
None of these things seemed to have any immediate effect.
But suddenly after days of warning about stray host with a daemon (that
does not even belong to that host), the warning disappeared.
I am afraid I may have done some other things out of desperation that I do
not recall at the moment.
I hope the stray host + stray daemon warning does not return.
Thanks again!
Jakub
On Mon, 18 Nov 2024 at 16:11, Eugen Block <eblock@xxxxxx> wrote:
Hi,
just to be safe, did you fail the mgr? If not, try 'ceph mgr fail' and
see if it still reports that information. It sounds like you didn't
clean up your virtual MON after you drained the OSDs. Or how exactly
did you drain that host? If you run 'ceph orch host drain {host}' the
orchestrator will remove all daemons, not only OSDs. I assume that
there are still some entries in the ceph config-key database where the
'device ls-by-host' output comes from, I haven't verified that though.
Is this key present?
ceph config-key get mgr/cephadm/host.cephfs-cluster-node-2 | jq
Is the VM still reachable? If it is, check the MON directory under
/var/lib/ceph/{FSID}/mon.X and remove it. You can also first check
with 'cephadm ls --no-detail' if cephadm believes that there's a MON
daemon located. Remove the directory so cephadm forgets about it.
Zitat von Jakub Daniel <jakub.daniel@xxxxxxxxx>:
> I have found these two threads
>
>
https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/VHZ7IJ7PAL7L2INLSHNVYY7V7ZCXD46G/#TSWERUMAEEGZPSYXG6PSS4YMRXPP3L63
>
>
https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/NG5QVRTVCLLYNLK56CSYLIPE4WBFXS5U/#HJDBAJFX27KATC4WV2MKGLVGLN2HTWWD
>
> but I didn't exactly figure out what to do. I've since found remnants of
> removed cephfs-cluster-node-{0,1,2} in crush map buckets. Which I removed
> with no effect on health detail. I found out that ceph dashboard lists
the
> non-existent cephfs-cluster-node-2 among ceph hosts (while orch host ls
> doesn't). On the other hand, device ls-by-host cephfs-cluster-node-2
lists
> an entry with device name coinciding with a drive on the live X host,
with
> daemons listing the mon.X. Meanwhile device ls-by-host X lists among
other
> things the exact same entry with the difference that in the dev column
> there is the actual device nvme0n1 whereas with cephfs-cluster-node-2 the
> column was empty
>
> root@X:~# cephadm shell -- ceph device ls-by-host cephfs-cluster-node-2
> DEVICE DEV DAEMONS EXPECTED FAILURE
> Samsung_SSD_970_PRO_1TB_S462NF0M310269L mon.X
>
> root@X:~# cephadm shell -- ceph device ls-by-host X
> DEVICE DEV DAEMONS
> EXPECTED FAILURE
> Samsung_SSD_850_EVO_250GB_S2R6NX0J423123P sdb osd.3
> Samsung_SSD_860_EVO_1TB_S3Z9NY0M431048H mon.Y
> Samsung_SSD_970_PRO_1TB_S462NF0M310269L nvme0n1 mon.X
>
> This output is confusing since it lists mon.Y also when asked for host X.
>
> I will continue investigating. If anyone has any hints what to try or
where
> to look, I would be very grateful.
>
> Jakub
>
> On Sun, 17 Nov 2024, 15:34 Tim Holloway, <timh@xxxxxxxxxxxxx> wrote:
>
>> I think I can count 5 sources that Ceph can query to
>> report/display/control its resources.
>>
>> 1. The /dec/ceph/ceph.conf file. Mostly supplanted bt the Ceph
>> configuration database.
>>
>> 2. The ceph configuration database. A namelesskey/value store internal
>> to a ceph filesystem. It's distributed (no fixed location), accessed by
>> Ceph commands and APIs/
>>
>> 3. Legacy Ceph resources/ Stuff found under a host's /var/lib/ceph
>> directory.
>>
>> 4. Managed Ceph resources. Stuff found under a host's
>> /var/lib/ceph/{fsid} diirectory.
>>
>> 5. The live machine state of Ceph. Since this not only can vary from
>> host to host, but also service to service, I don't think that this is
>> considered to be an authoritative source of information.
>>
>> Compounding this is that current releases of Ceph can all too easily
>> end up in a "forbidden" state where you may have, for example a legacy
>> OSD.6 and a managed OSD.6 on the same host. In such a case, system is
>> generally operable, but functionally corrupt and ideally should be
>> corrected to remove the redundant resource.
>>
>> The real issue is that depending on what Ceph interface you're querying
>> (or "ceph health" is querying!), you don't always get your answer from
>> a single authoritative source, so you'll get conflicting results and
>> annoying error reports. The "stray daemon" condition is an especially
>> egregious example of this, and it's not only possible because of a
>> false detection from one of the above sources, but also, I think can
>> come from "dead" daemons being referenced in CRUSH.
>>
>> You might want to run through this lists's history for "phantom host"
>> postings made by me back around this past June because I was absolutely
>> plagued with them. Eugen Block helped me eventually purge them all.
>>
>> Regards,
>> Tim
>>
>> On Sat, 2024-11-16 at 21:42 +0100, Jakub Daniel wrote:
>>> Hello,
>>>
>>> I'm pretty new to ceph deployment. I have setup my first cephfs
>>> cluster
>>> using cephadm. Initially, I deployed ceph in 3 virtualbox instances
>>> that I
>>> called cephfs-cluster-node-{0, 1, 2} just to test things. Later, I
>>> added 5
>>> more real hardware nodes. Later I decided I'd remove the
>>> virtualboxes, so I
>>> drained the osds and removed the hosts. Suddenly, ceph status detail
>>> started reporting
>>>
>>> HEALTH_WARN 1 stray host(s) with 1 daemon(s) not managed by cephadm
>>> [WRN]
>>> CEPHADM_STRAY_HOST: 1 stray host(s) with 1 daemon(s) not managed by
>>> cephadm
>>> stray host cephfs-cluster-node-2 has 1 stray daemons: ['mon.X']
>>>
>>> The cephfs-cluster-node-2 is no longer listed among hosts, it is (and
>>> has
>>> been for tens of hours) offline (powered down). The mon.X doesn't
>>> even
>>> belong to that node, it is one of the real hardware nodes. I am
>>> unaware of
>>> mon.X ever running on cephfs-cluster-node-2 (never noticed it among
>>> systemd
>>> units).
>>>
>>> Where does cephadm shell -- ceph status detail come to the conclusion
>>> there
>>> is something stray? How can I address this?
>>>
>>> Thank you for any insights
>>> Jakub
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>
>>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx