Ceph Orchestrator ignores attribute filters for SSDs

Janek Bevendorff <janek.bevendorff@xxxxxxxxxxxxx> · Fri, 10 Jan 2025 13:53:00 +0100

Hi,

I'm having a strange problem with the orchestrator. My cluster has the 
following OSD services configured based on certain attributes of the disks:

NAME                 PORTS  RUNNING  REFRESHED  AGE PLACEMENT
...
osd.osd-default-hdd            1351  2m ago     22m label:osd;HOSTPREFIX*
osd.osd-default-ssd               0  -          22m label:osd;HOSTPREFIX*
osd.osd-small-hdd                41  2m ago     22m label:osd;HOSTPREFIX*

These apply to three device types: large HDDs (8TB+), small HDDs 
(250G-7TB), and SSDs (1TB+). I did that with the following YAML definition:

service_type: osd
service_id: osd-default-hdd
service_name: osd.osd-default-hdd
placement:
  host_pattern: HOSTPREFIX*
  label: osd
spec:
  crush_device_class: hdd
  data_devices:
    rotational: 1
    size: '8T:'
  filter_logic: AND
  objectstore: bluestore
  osds_per_device: 1
---
service_type: osd
service_id: osd-default-ssd
service_name: osd.osd-default-ssd
placement:
  host_pattern: HOSTPREFIX*
  label: osd
spec:
  crush_device_class: ssd
  data_devices:
    rotational: 0
    size: '1T:'
  filter_logic: AND
  objectstore: bluestore
  osds_per_device: 1
---
service_type: osd
service_id: osd-small-hdd
service_name: osd.osd-small-hdd
placement:
  host_pattern: HOSTPREFIX*
  label: osd
spec:
  crush_device_class: hdd-small
  data_devices:
    rotational: 1
    size: 250G:7T
  filter_logic: AND
  objectstore: bluestore
  osds_per_device: 1

Previously, this worked perfectly, but as you can see in the summary 
above, now the orchestrator suddenly started to ignore the device class 
and data_devices filters for SSDs and incorrectly added all SSDs to the 
osd.osd-default-hdd service (weirdly enough, hdd-small still works).

The affected devices still have the correct device class in the CRUSH 
tree and it also looks like the data placement is fine. The orchestrator 
service listing, however, is incorrect. I tried cleaning out and freshly 
redeploying one of the SSD OSDs, but the redeployed service still has 
the following in the unit.meta file:

{
    "service_name": "osd.osd-default-hdd",
    "ports": [],
    "ip": null,
    "deployed_by": [
"quay.io/ceph/ceph@sha256:ac06cdca6f2512a763f1ace8553330e454152b82f95a2b6bf33c3f3ec2eeac77",
"quay.io/ceph/ceph@sha256:6ac7f923aa1d23b43248ce0ddec7e1388855ee3d00813b52c3172b0b23b37906"
    ],
    "rank": null,
    "rank_generation": null,
    "extra_container_args": null,
    "extra_entrypoint_args": null,
    "memory_request": null,
    "memory_limit": null
}

Any idea what might be causing this? I'm on Ceph 18.2.4 (upgrade 
planned, but I need to wait out some remapped PGs first).

Janek

Attachment:
smime.p7s

Description: S/MIME Cryptographic Signature
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx