Re: cephadm does not redeploy OSD

Adam King <adking@xxxxxxxxxx> · Tue, 18 Jul 2023 12:15:44 -0400

in the "ceph orch device ls --format json-pretty" output, in the blob for
that specific device, is the "ceph_device" field set? There was a bug where
it wouldn't be set at all (https://tracker.ceph.com/issues/57100) and it
would make it so you couldn't use a device serving as a db device for any
further OSDs, unless the device was fully cleaned out (so it is no longer
serving as a db device). The "ceph_device" field is meant to be our way of
knowing "yes there are LVM partitions here, but they're our partitions for
ceph stuff, so we can still use the device" and without it (or with it just
being broken, as in the tracker) redeploying OSDs that used the device for
its DB wasn't working as we don't know if those LVs imply its our device or
has LVs for some other purpose. I had thought this was fixed already in
16.2.13 but it sounds too similar to what you're seeing not to consider it.

On Tue, Jul 18, 2023 at 10:53 AM Luis Domingues <luis.domingues@xxxxxxxxx>
wrote:

> Hi,
>
> We are running a ceph cluster managed with cephadm v16.2.13. Recently we
> needed to change a disk, and we replaced it with:
>
> ceph orch osd rm 37 --replace.
>
> It worked fine, the disk was drained and the OSD marked as destroy.
>
> However, after changing the disk, no OSD was created. Looking to the db
> device, the partition for db for OSD 37 was still there. So we destroyed it
> using:
> ceph-volume lvm zap --osd-id=37 --destroy.
>
> But we still have no OSD redeployed.
> Here we have our spec:
>
> ---
> service_type: osd
> service_id: osd-hdd
> placement:
> label: osds
> spec:
> data_devices:
> rotational: 1
> encrypted: true
> db_devices:
> size: '1TB:2TB' db_slots: 12
>
> And the disk looks good:
>
> HOST PATH TYPE DEVICE ID SIZE AVAILABLE REFRESHED REJECT REASONS
> node05 /dev/nvme2n1 ssd SAMSUNG MZPLJ1T6HBJR-00007_S55JNG0R600357 1600G
> 12m ago LVM detected, locked
>
> node05 /dev/sdk hdd SEAGATE_ST10000NM0206_ZA21G2170000C7240KPF 10.0T Yes
> 12m ago
>
> And VG on db_device looks to have enough space:
> ceph-33b06f1a-f6f6-57cf-9ca8-6e4aa81caae0 1 11 0 wz--n- <1.46t 173.91g
>
> If I remove the db_devices and db_slots from the specs, and do a dry run,
> the orchestrator seems to see the new disk as available:
>
> ceph orch apply -i osd_specs.yml --dry-run
> WARNING! Dry-Runs are snapshots of a certain point in time and are bound
> to the current inventory setup. If any of these conditions change, the
> preview will be invalid. Please make sure to have a minimal
> timeframe between planning and applying the specs.
> ####################
> SERVICESPEC PREVIEWS
> ####################
> +---------+------+--------+-------------+
> |SERVICE |NAME |ADD_TO |REMOVE_FROM |
> +---------+------+--------+-------------+
> +---------+------+--------+-------------+
> ################
> OSDSPEC PREVIEWS
> ################
> +---------+---------+-------------------------+----------+----+-----+
> |SERVICE |NAME |HOST |DATA |DB |WAL |
> +---------+---------+-------------------------+----------+----+-----+
> |osd |osd-hdd |node05 |/dev/sdk |- |- |
> +---------+---------+-------------------------+----------+----+-----+
>
> But as soon as I add db_devices back, the orchestrator is happy as it is,
> like there is nothing to do:
>
> ceph orch apply -i osd_specs.yml --dry-run
> WARNING! Dry-Runs are snapshots of a certain point in time and are bound
> to the current inventory setup. If any of these conditions change, the
> preview will be invalid. Please make sure to have a minimal
> timeframe between planning and applying the specs.
> ####################
> SERVICESPEC PREVIEWS
> ####################
> +---------+------+--------+-------------+
> |SERVICE |NAME |ADD_TO |REMOVE_FROM |
> +---------+------+--------+-------------+
> +---------+------+--------+-------------+
> ################
> OSDSPEC PREVIEWS
> ################
> +---------+------+------+------+----+-----+
> |SERVICE |NAME |HOST |DATA |DB |WAL |
> +---------+------+------+------+----+-----+
>
> I do not know why ceph will not use this disk, and I do not know where to
> look. It seems logs are not saying anything. And the weirdest thing,
> another disk was replaced on the same machine, and it went without any
> issues.
>
> Luis Domingues
> Proton AG
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx