Re: Ceph Orchestrator ignores attribute filters for SSDs

Frédéric Nass <frederic.nass@xxxxxxxxxxxxxxxx> · Fri, 10 Jan 2025 15:32:16 +0100 (CET)

Hi Janek,

Have you tried looking into the orchestrator's decisions?

$ ceph config set mgr mgr/cephadm/log_to_cluster_level debug

then 

$ ceph -W cephadm --watch-debug

or look into active MGR's /var/log/ceph/$(ceph fsid)/ceph.cephadm.log

Regards,
Frédéric.

----- Le 10 Jan 25, à 13:53, Janek Bevendorff janek.bevendorff@xxxxxxxxxxxxx a écrit :

> Hi,
> 
> I'm having a strange problem with the orchestrator. My cluster has the
> following OSD services configured based on certain attributes of the disks:
> 
> NAME                 PORTS  RUNNING  REFRESHED  AGE PLACEMENT
> ...
> osd.osd-default-hdd            1351  2m ago     22m label:osd;HOSTPREFIX*
> osd.osd-default-ssd               0  -          22m label:osd;HOSTPREFIX*
> osd.osd-small-hdd                41  2m ago     22m label:osd;HOSTPREFIX*
> 
> These apply to three device types: large HDDs (8TB+), small HDDs
> (250G-7TB), and SSDs (1TB+). I did that with the following YAML definition:
> 
> service_type: osd
> service_id: osd-default-hdd
> service_name: osd.osd-default-hdd
> placement:
>   host_pattern: HOSTPREFIX*
>   label: osd
> spec:
>   crush_device_class: hdd
>   data_devices:
>     rotational: 1
>     size: '8T:'
>   filter_logic: AND
>   objectstore: bluestore
>   osds_per_device: 1
> ---
> service_type: osd
> service_id: osd-default-ssd
> service_name: osd.osd-default-ssd
> placement:
>   host_pattern: HOSTPREFIX*
>   label: osd
> spec:
>   crush_device_class: ssd
>   data_devices:
>     rotational: 0
>     size: '1T:'
>   filter_logic: AND
>   objectstore: bluestore
>   osds_per_device: 1
> ---
> service_type: osd
> service_id: osd-small-hdd
> service_name: osd.osd-small-hdd
> placement:
>   host_pattern: HOSTPREFIX*
>   label: osd
> spec:
>   crush_device_class: hdd-small
>   data_devices:
>     rotational: 1
>     size: 250G:7T
>   filter_logic: AND
>   objectstore: bluestore
>   osds_per_device: 1
> 
> 
> Previously, this worked perfectly, but as you can see in the summary
> above, now the orchestrator suddenly started to ignore the device class
> and data_devices filters for SSDs and incorrectly added all SSDs to the
> osd.osd-default-hdd service (weirdly enough, hdd-small still works).
> 
> The affected devices still have the correct device class in the CRUSH
> tree and it also looks like the data placement is fine. The orchestrator
> service listing, however, is incorrect. I tried cleaning out and freshly
> redeploying one of the SSD OSDs, but the redeployed service still has
> the following in the unit.meta file:
> 
> {
>     "service_name": "osd.osd-default-hdd",
>     "ports": [],
>     "ip": null,
>     "deployed_by": [
> "quay.io/ceph/ceph@sha256:ac06cdca6f2512a763f1ace8553330e454152b82f95a2b6bf33c3f3ec2eeac77",
> "quay.io/ceph/ceph@sha256:6ac7f923aa1d23b43248ce0ddec7e1388855ee3d00813b52c3172b0b23b37906"
>     ],
>     "rank": null,
>     "rank_generation": null,
>     "extra_container_args": null,
>     "extra_entrypoint_args": null,
>     "memory_request": null,
>     "memory_limit": null
> }
> 
> Any idea what might be causing this? I'm on Ceph 18.2.4 (upgrade
> planned, but I need to wait out some remapped PGs first).
> 
> Janek
> 
> 
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx