Re: cephadm Failed to apply 1 service(s)

Zakhar Kirpichenko <zakhar@xxxxxxxxx> · Fri, 16 Feb 2024 14:53:08 +0200

Many thanks for your help, Eugen! Things are back to normal now :-)

/Z

On Fri, 16 Feb 2024 at 14:52, Eugen Block <eblock@xxxxxx> wrote:

> Sure, you can save the drivegroup spec in a file, edit it according to
> your requirements (not sure if having device paths in there makes
> sense though) and apply it:
>
> ceph orch apply -i new-drivegroup.yml
>
> Zitat von Zakhar Kirpichenko <zakhar@xxxxxxxxx>:
>
> > Many thanks for your response, Eugen!
> >
> > I tried to fail mgr twice, unfortunately that had no effect on the issue.
> > Neither `cephadm ceph-volume inventory` nor `ceph device ls-by-host
> ceph03`
> > have the failed drive on the list.
> >
> > Though your assumption is correct, the spec appears to explicitly include
> > the failed drive:
> >
> > ---
> > service_type: osd
> > service_id: ceph03_combined_osd
> > service_name: osd.ceph03_combined_osd
> > placement:
> >   hosts:
> >   - ceph03
> > spec:
> >   data_devices:
> >     paths:
> > ...
> >     - /dev/sde
> > ...
> >   db_devices:
> >     paths:
> >     - /dev/nvme0n1
> >     - /dev/nvme1n1
> >   filter_logic: AND
> >   objectstore: bluestore
> > ---
> >
> > Do you know the best way to remove the device from the spec?
> >
> > /Z
> >
> > On Fri, 16 Feb 2024 at 14:10, Eugen Block <eblock@xxxxxx> wrote:
> >
> >> Hi,
> >>
> >> sometimes the easiest fix is to failover the mgr, have you tried that?
> >> If that didn't work, can you share the drivegroup spec?
> >>
> >> ceph orch ls <your_osd_spec> --export
> >>
> >> Does it contain specific device paths or something? Does 'cephadm ls'
> >> on that node show any traces of the previous OSD?
> >> I'd probably try to check some things like
> >>
> >> cephadm ceph-volume inventory
> >> ceph device ls-by-host <host>
> >>
> >> Regards,
> >> Eugen
> >>
> >> Zitat von Zakhar Kirpichenko <zakhar@xxxxxxxxx>:
> >>
> >> > Hi,
> >> >
> >> > We had a physical drive malfunction in one of our Ceph OSD hosts
> managed
> >> by
> >> > cephadm (Ceph 16.2.14). I have removed the drive from the system, and
> the
> >> > kernel no longer sees it:
> >> >
> >> > ceph03 ~]# ls -al /dev/sde
> >> > ls: cannot access '/dev/sde': No such file or directory
> >> >
> >> > I have removed the corresponding OSD from cephadm, crush map, etc. For
> >> all
> >> > intents and purposes that OSD and its block device no longer exist:
> >> >
> >> > root@ceph01:/# ceph orch ps | grep osd.26
> >> > root@ceph01:/# ceph osd tree| grep 26
> >> > root@ceph01:/# ceph orch device ls | grep -E "ceph03.*sde"
> >> >
> >> > None of the above commands return anything. Cephadm correctly sees 8
> >> > remaining OSDs on the host:
> >> >
> >> > root@ceph01:/# ceph orch ls | grep ceph03_c
> >> > osd.ceph03_combined_osd                     8  33s ago    2y   ceph03
> >> >
> >> > Unfortunately, cephadm appears to be trying to apply a spec to host
> >> ceph03
> >> > including the disk that is now missing:
> >> >
> >> > RuntimeError: Failed command: /usr/bin/docker run --rm --ipc=host
> >> > --stop-signal=SIGTERM --net=host --entrypoint /usr/sbin/ceph-volume
> >> > --privileged --group-add=disk --init -e CONTAINER_IMAGE=
> >> >
> >>
> quay.io/ceph/ceph@sha256:843f112990e6489362c625229c3ea3d90b8734bd5e14e0aeaf89942fbb980a8b
> >> > -e NODE_NAME=ceph03 -e CEPH_USE_RANDOM_NONCE=1 -e
> >> > CEPH_VOLUME_OSDSPEC_AFFINITY=ceph03_combined_osd -e
> >> > CEPH_VOLUME_SKIP_RESTORECON=yes -e CEPH_VOLUME_DEBUG=1 -v
> >> > /var/run/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86:/var/run/ceph:z -v
> >> > /var/log/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86:/var/log/ceph:z -v
> >> >
> >>
> /var/lib/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86/crash:/var/lib/ceph/crash:z
> >> > -v /dev:/dev -v /run/udev:/run/udev -v /sys:/sys -v /run/lvm:/run/lvm
> -v
> >> > /run/lock/lvm:/run/lock/lvm -v /:/rootfs -v
> >> > /tmp/ceph-tmpc7b33pf0:/etc/ceph/ceph.conf:z -v
> >> > /tmp/ceph-tmpq45nkmd6:/var/lib/ceph/bootstrap-osd/ceph.keyring:z
> >> >
> >>
> quay.io/ceph/ceph@sha256:843f112990e6489362c625229c3ea3d90b8734bd5e14e0aeaf89942fbb980a8b
> >> > lvm batch --no-auto /dev/sda /dev/sdb /dev/sdc /dev/sdd /dev/sde
> /dev/sdf
> >> > /dev/sdg /dev/sdh /dev/sdi --db-devices /dev/nvme0n1 /dev/nvme1n1
> --yes
> >> > --no-systemd
> >> >
> >> > Note that `lvm batch` includes the missing drive, /dev/sde. This fails
> >> > because the drive no longer exists. Other than this cephadm
> ceph-volume
> >> > thingy, the cluster is healthy.How can I tell cephadm that it should
> stop
> >> > trying to use /dev/sde, which no longer exists, without affecting
> other
> >> > OSDs on the host?
> >> >
> >> > I would very much appreciate any advice or pointers.
> >> >
> >> > Best regards,
> >> > Zakhar
> >> > _______________________________________________
> >> > ceph-users mailing list -- ceph-users@xxxxxxx
> >> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >>
> >>
> >> _______________________________________________
> >> ceph-users mailing list -- ceph-users@xxxxxxx
> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >>
>
>
>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx