Many thanks for your help, Eugen! Things are back to normal now :-) /Z On Fri, 16 Feb 2024 at 14:52, Eugen Block <eblock@xxxxxx> wrote: > Sure, you can save the drivegroup spec in a file, edit it according to > your requirements (not sure if having device paths in there makes > sense though) and apply it: > > ceph orch apply -i new-drivegroup.yml > > Zitat von Zakhar Kirpichenko <zakhar@xxxxxxxxx>: > > > Many thanks for your response, Eugen! > > > > I tried to fail mgr twice, unfortunately that had no effect on the issue. > > Neither `cephadm ceph-volume inventory` nor `ceph device ls-by-host > ceph03` > > have the failed drive on the list. > > > > Though your assumption is correct, the spec appears to explicitly include > > the failed drive: > > > > --- > > service_type: osd > > service_id: ceph03_combined_osd > > service_name: osd.ceph03_combined_osd > > placement: > > hosts: > > - ceph03 > > spec: > > data_devices: > > paths: > > ... > > - /dev/sde > > ... > > db_devices: > > paths: > > - /dev/nvme0n1 > > - /dev/nvme1n1 > > filter_logic: AND > > objectstore: bluestore > > --- > > > > Do you know the best way to remove the device from the spec? > > > > /Z > > > > On Fri, 16 Feb 2024 at 14:10, Eugen Block <eblock@xxxxxx> wrote: > > > >> Hi, > >> > >> sometimes the easiest fix is to failover the mgr, have you tried that? > >> If that didn't work, can you share the drivegroup spec? > >> > >> ceph orch ls <your_osd_spec> --export > >> > >> Does it contain specific device paths or something? Does 'cephadm ls' > >> on that node show any traces of the previous OSD? > >> I'd probably try to check some things like > >> > >> cephadm ceph-volume inventory > >> ceph device ls-by-host <host> > >> > >> Regards, > >> Eugen > >> > >> Zitat von Zakhar Kirpichenko <zakhar@xxxxxxxxx>: > >> > >> > Hi, > >> > > >> > We had a physical drive malfunction in one of our Ceph OSD hosts > managed > >> by > >> > cephadm (Ceph 16.2.14). I have removed the drive from the system, and > the > >> > kernel no longer sees it: > >> > > >> > ceph03 ~]# ls -al /dev/sde > >> > ls: cannot access '/dev/sde': No such file or directory > >> > > >> > I have removed the corresponding OSD from cephadm, crush map, etc. For > >> all > >> > intents and purposes that OSD and its block device no longer exist: > >> > > >> > root@ceph01:/# ceph orch ps | grep osd.26 > >> > root@ceph01:/# ceph osd tree| grep 26 > >> > root@ceph01:/# ceph orch device ls | grep -E "ceph03.*sde" > >> > > >> > None of the above commands return anything. Cephadm correctly sees 8 > >> > remaining OSDs on the host: > >> > > >> > root@ceph01:/# ceph orch ls | grep ceph03_c > >> > osd.ceph03_combined_osd 8 33s ago 2y ceph03 > >> > > >> > Unfortunately, cephadm appears to be trying to apply a spec to host > >> ceph03 > >> > including the disk that is now missing: > >> > > >> > RuntimeError: Failed command: /usr/bin/docker run --rm --ipc=host > >> > --stop-signal=SIGTERM --net=host --entrypoint /usr/sbin/ceph-volume > >> > --privileged --group-add=disk --init -e CONTAINER_IMAGE= > >> > > >> > quay.io/ceph/ceph@sha256:843f112990e6489362c625229c3ea3d90b8734bd5e14e0aeaf89942fbb980a8b > >> > -e NODE_NAME=ceph03 -e CEPH_USE_RANDOM_NONCE=1 -e > >> > CEPH_VOLUME_OSDSPEC_AFFINITY=ceph03_combined_osd -e > >> > CEPH_VOLUME_SKIP_RESTORECON=yes -e CEPH_VOLUME_DEBUG=1 -v > >> > /var/run/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86:/var/run/ceph:z -v > >> > /var/log/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86:/var/log/ceph:z -v > >> > > >> > /var/lib/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86/crash:/var/lib/ceph/crash:z > >> > -v /dev:/dev -v /run/udev:/run/udev -v /sys:/sys -v /run/lvm:/run/lvm > -v > >> > /run/lock/lvm:/run/lock/lvm -v /:/rootfs -v > >> > /tmp/ceph-tmpc7b33pf0:/etc/ceph/ceph.conf:z -v > >> > /tmp/ceph-tmpq45nkmd6:/var/lib/ceph/bootstrap-osd/ceph.keyring:z > >> > > >> > quay.io/ceph/ceph@sha256:843f112990e6489362c625229c3ea3d90b8734bd5e14e0aeaf89942fbb980a8b > >> > lvm batch --no-auto /dev/sda /dev/sdb /dev/sdc /dev/sdd /dev/sde > /dev/sdf > >> > /dev/sdg /dev/sdh /dev/sdi --db-devices /dev/nvme0n1 /dev/nvme1n1 > --yes > >> > --no-systemd > >> > > >> > Note that `lvm batch` includes the missing drive, /dev/sde. This fails > >> > because the drive no longer exists. Other than this cephadm > ceph-volume > >> > thingy, the cluster is healthy.How can I tell cephadm that it should > stop > >> > trying to use /dev/sde, which no longer exists, without affecting > other > >> > OSDs on the host? > >> > > >> > I would very much appreciate any advice or pointers. > >> > > >> > Best regards, > >> > Zakhar > >> > _______________________________________________ > >> > ceph-users mailing list -- ceph-users@xxxxxxx > >> > To unsubscribe send an email to ceph-users-leave@xxxxxxx > >> > >> > >> _______________________________________________ > >> ceph-users mailing list -- ceph-users@xxxxxxx > >> To unsubscribe send an email to ceph-users-leave@xxxxxxx > >> > > > > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx