Re: cephadm Failed to apply 1 service(s)

Eugen Block <eblock@xxxxxx> · Fri, 16 Feb 2024 12:09:55 +0000

Hi,

sometimes the easiest fix is to failover the mgr, have you tried that?  
If that didn't work, can you share the drivegroup spec?

ceph orch ls <your_osd_spec> --export

Does it contain specific device paths or something? Does 'cephadm ls'  
on that node show any traces of the previous OSD?
I'd probably try to check some things like

cephadm ceph-volume inventory
ceph device ls-by-host <host>

Regards,
Eugen

Zitat von Zakhar Kirpichenko <zakhar@xxxxxxxxx>:

Hi,

We had a physical drive malfunction in one of our Ceph OSD hosts managed by
cephadm (Ceph 16.2.14). I have removed the drive from the system, and the
kernel no longer sees it:

ceph03 ~]# ls -al /dev/sde
ls: cannot access '/dev/sde': No such file or directory

I have removed the corresponding OSD from cephadm, crush map, etc. For all
intents and purposes that OSD and its block device no longer exist:

root@ceph01:/# ceph orch ps | grep osd.26
root@ceph01:/# ceph osd tree| grep 26
root@ceph01:/# ceph orch device ls | grep -E "ceph03.*sde"

None of the above commands return anything. Cephadm correctly sees 8
remaining OSDs on the host:

root@ceph01:/# ceph orch ls | grep ceph03_c
osd.ceph03_combined_osd                     8  33s ago    2y   ceph03

Unfortunately, cephadm appears to be trying to apply a spec to host ceph03
including the disk that is now missing:

RuntimeError: Failed command: /usr/bin/docker run --rm --ipc=host
--stop-signal=SIGTERM --net=host --entrypoint /usr/sbin/ceph-volume
--privileged --group-add=disk --init -e CONTAINER_IMAGE=
quay.io/ceph/ceph@sha256:843f112990e6489362c625229c3ea3d90b8734bd5e14e0aeaf89942fbb980a8b
-e NODE_NAME=ceph03 -e CEPH_USE_RANDOM_NONCE=1 -e
CEPH_VOLUME_OSDSPEC_AFFINITY=ceph03_combined_osd -e
CEPH_VOLUME_SKIP_RESTORECON=yes -e CEPH_VOLUME_DEBUG=1 -v
/var/run/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86:/var/run/ceph:z -v
/var/log/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86:/var/log/ceph:z -v
/var/lib/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86/crash:/var/lib/ceph/crash:z
-v /dev:/dev -v /run/udev:/run/udev -v /sys:/sys -v /run/lvm:/run/lvm -v
/run/lock/lvm:/run/lock/lvm -v /:/rootfs -v
/tmp/ceph-tmpc7b33pf0:/etc/ceph/ceph.conf:z -v
/tmp/ceph-tmpq45nkmd6:/var/lib/ceph/bootstrap-osd/ceph.keyring:z
quay.io/ceph/ceph@sha256:843f112990e6489362c625229c3ea3d90b8734bd5e14e0aeaf89942fbb980a8b
lvm batch --no-auto /dev/sda /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf
/dev/sdg /dev/sdh /dev/sdi --db-devices /dev/nvme0n1 /dev/nvme1n1 --yes
--no-systemd

Note that `lvm batch` includes the missing drive, /dev/sde. This fails
because the drive no longer exists. Other than this cephadm ceph-volume
thingy, the cluster is healthy.How can I tell cephadm that it should stop
trying to use /dev/sde, which no longer exists, without affecting other
OSDs on the host?

I would very much appreciate any advice or pointers.

Best regards,
Zakhar
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx