cephadm Failed to apply 1 service(s)

Zakhar Kirpichenko <zakhar@xxxxxxxxx> · Fri, 16 Feb 2024 13:45:02 +0200

Hi,

We had a physical drive malfunction in one of our Ceph OSD hosts managed by
cephadm (Ceph 16.2.14). I have removed the drive from the system, and the
kernel no longer sees it:

ceph03 ~]# ls -al /dev/sde
ls: cannot access '/dev/sde': No such file or directory

I have removed the corresponding OSD from cephadm, crush map, etc. For all
intents and purposes that OSD and its block device no longer exist:

root@ceph01:/# ceph orch ps | grep osd.26
root@ceph01:/# ceph osd tree| grep 26
root@ceph01:/# ceph orch device ls | grep -E "ceph03.*sde"

None of the above commands return anything. Cephadm correctly sees 8
remaining OSDs on the host:

root@ceph01:/# ceph orch ls | grep ceph03_c
osd.ceph03_combined_osd                     8  33s ago    2y   ceph03

Unfortunately, cephadm appears to be trying to apply a spec to host ceph03
including the disk that is now missing:

RuntimeError: Failed command: /usr/bin/docker run --rm --ipc=host
--stop-signal=SIGTERM --net=host --entrypoint /usr/sbin/ceph-volume
--privileged --group-add=disk --init -e CONTAINER_IMAGE=
quay.io/ceph/ceph@sha256:843f112990e6489362c625229c3ea3d90b8734bd5e14e0aeaf89942fbb980a8b
-e NODE_NAME=ceph03 -e CEPH_USE_RANDOM_NONCE=1 -e
CEPH_VOLUME_OSDSPEC_AFFINITY=ceph03_combined_osd -e
CEPH_VOLUME_SKIP_RESTORECON=yes -e CEPH_VOLUME_DEBUG=1 -v
/var/run/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86:/var/run/ceph:z -v
/var/log/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86:/var/log/ceph:z -v
/var/lib/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86/crash:/var/lib/ceph/crash:z
-v /dev:/dev -v /run/udev:/run/udev -v /sys:/sys -v /run/lvm:/run/lvm -v
/run/lock/lvm:/run/lock/lvm -v /:/rootfs -v
/tmp/ceph-tmpc7b33pf0:/etc/ceph/ceph.conf:z -v
/tmp/ceph-tmpq45nkmd6:/var/lib/ceph/bootstrap-osd/ceph.keyring:z
quay.io/ceph/ceph@sha256:843f112990e6489362c625229c3ea3d90b8734bd5e14e0aeaf89942fbb980a8b
lvm batch --no-auto /dev/sda /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf
/dev/sdg /dev/sdh /dev/sdi --db-devices /dev/nvme0n1 /dev/nvme1n1 --yes
--no-systemd

Note that `lvm batch` includes the missing drive, /dev/sde. This fails
because the drive no longer exists. Other than this cephadm ceph-volume
thingy, the cluster is healthy.How can I tell cephadm that it should stop
trying to use /dev/sde, which no longer exists, without affecting other
OSDs on the host?

I would very much appreciate any advice or pointers.

Best regards,
Zakhar
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx