> > When looking on the very verbous cephadm logs, it seemed that cephadm was > just skipping my node, with a message saying that a node was already part > of another spec. > If you have it, would you mind sharing what this message was? I'm still not totally sure what happened here. On Wed, Jul 19, 2023 at 10:15 AM Luis Domingues <luis.domingues@xxxxxxxxx> wrote: > So good news, I was not hit by the bug you mention on this thread. > > What happened, (apparently, I did not tried to replicated it yet) is that > I had another OSD (let call it OSD.1) using the db device, but that was > part of an old spec. (let call it spec-a). And the OSD (OSD.2) I removed > should be detected as part of spec-b. The difference between them was just > the name and the placement, using labels instead of hostname. > > When looking on the very verbous cephadm logs, it seemed that cephadm was > just skipping my node, with a message saying that a node was already part > of another spec. > > I purged OSD.1 with --replace and --zap, and once disks where empty and > ready to go, cephamd just added back OSD.1 and OSD.2 using the db_device as > specified. > > I do not know if this is the intended behavior, or if I was just lucky, > but all my OSDs are back to the cluster. > > Luis Domingues > Proton AG > > > ------- Original Message ------- > On Tuesday, July 18th, 2023 at 18:32, Luis Domingues < > luis.domingues@xxxxxxxxx> wrote: > > > > That part looks quite good: > > > > "available": false, > > "ceph_device": true, > > "created": "2023-07-18T16:01:16.715487Z", > > "device_id": "SAMSUNG MZPLJ1T6HBJR-00007_S55JNG0R600354", > > "human_readable_type": "ssd", > > "lsm_data": {}, > > "lvs": [ > > { > > "cluster_fsid": "11b47c57-5e7f-44c0-8b19-ddd801a89435", > > "cluster_name": "ceph", > > "db_uuid": "CUMgp7-Uscn-ASLo-bh14-7Sxe-80GE-EcywDb", > > "name": "osd-block-db-5cb8edda-30f9-539f-b4c5-dbe420927911", > > "osd_fsid": "089894cf-1782-4a3a-8ac0-9dd043f80c71", > > "osd_id": "7", > > "osdspec_affinity": "", > > "type": "db" > > }, > > { > > > > I forgot to mention that the cluster was initially deployed with > ceph-ansible and adopted by cephadm. > > > > Luis Domingues > > Proton AG > > > > > > > > > > ------- Original Message ------- > > On Tuesday, July 18th, 2023 at 18:15, Adam King adking@xxxxxxxxxx wrote: > > > > > > > > > in the "ceph orch device ls --format json-pretty" output, in the blob > for > > > that specific device, is the "ceph_device" field set? There was a bug > where > > > it wouldn't be set at all (https://tracker.ceph.com/issues/57100) and > it > > > would make it so you couldn't use a device serving as a db device for > any > > > further OSDs, unless the device was fully cleaned out (so it is no > longer > > > serving as a db device). The "ceph_device" field is meant to be our > way of > > > knowing "yes there are LVM partitions here, but they're our partitions > for > > > ceph stuff, so we can still use the device" and without it (or with it > just > > > being broken, as in the tracker) redeploying OSDs that used the device > for > > > its DB wasn't working as we don't know if those LVs imply its our > device or > > > has LVs for some other purpose. I had thought this was fixed already in > > > 16.2.13 but it sounds too similar to what you're seeing not to > consider it. > > > > > > On Tue, Jul 18, 2023 at 10:53 AM Luis Domingues > luis.domingues@xxxxxxxxx > > > > > > wrote: > > > > > > > Hi, > > > > > > > > We are running a ceph cluster managed with cephadm v16.2.13. > Recently we > > > > needed to change a disk, and we replaced it with: > > > > > > > > ceph orch osd rm 37 --replace. > > > > > > > > It worked fine, the disk was drained and the OSD marked as destroy. > > > > > > > > However, after changing the disk, no OSD was created. Looking to the > db > > > > device, the partition for db for OSD 37 was still there. So we > destroyed it > > > > using: > > > > ceph-volume lvm zap --osd-id=37 --destroy. > > > > > > > > But we still have no OSD redeployed. > > > > Here we have our spec: > > > > > > > > --- > > > > service_type: osd > > > > service_id: osd-hdd > > > > placement: > > > > label: osds > > > > spec: > > > > data_devices: > > > > rotational: 1 > > > > encrypted: true > > > > db_devices: > > > > size: '1TB:2TB' db_slots: 12 > > > > > > > > And the disk looks good: > > > > > > > > HOST PATH TYPE DEVICE ID SIZE AVAILABLE REFRESHED REJECT REASONS > > > > node05 /dev/nvme2n1 ssd SAMSUNG MZPLJ1T6HBJR-00007_S55JNG0R600357 > 1600G > > > > 12m ago LVM detected, locked > > > > > > > > node05 /dev/sdk hdd SEAGATE_ST10000NM0206_ZA21G2170000C7240KPF 10.0T > Yes > > > > 12m ago > > > > > > > > And VG on db_device looks to have enough space: > > > > ceph-33b06f1a-f6f6-57cf-9ca8-6e4aa81caae0 1 11 0 wz--n- <1.46t > 173.91g > > > > > > > > If I remove the db_devices and db_slots from the specs, and do a dry > run, > > > > the orchestrator seems to see the new disk as available: > > > > > > > > ceph orch apply -i osd_specs.yml --dry-run > > > > WARNING! Dry-Runs are snapshots of a certain point in time and are > bound > > > > to the current inventory setup. If any of these conditions change, > the > > > > preview will be invalid. Please make sure to have a minimal > > > > timeframe between planning and applying the specs. > > > > #################### > > > > SERVICESPEC PREVIEWS > > > > #################### > > > > +---------+------+--------+-------------+ > > > > |SERVICE |NAME |ADD_TO |REMOVE_FROM | > > > > +---------+------+--------+-------------+ > > > > +---------+------+--------+-------------+ > > > > ################ > > > > OSDSPEC PREVIEWS > > > > ################ > > > > +---------+---------+-------------------------+----------+----+-----+ > > > > |SERVICE |NAME |HOST |DATA |DB |WAL | > > > > +---------+---------+-------------------------+----------+----+-----+ > > > > |osd |osd-hdd |node05 |/dev/sdk |- |- | > > > > +---------+---------+-------------------------+----------+----+-----+ > > > > > > > > But as soon as I add db_devices back, the orchestrator is happy as > it is, > > > > like there is nothing to do: > > > > > > > > ceph orch apply -i osd_specs.yml --dry-run > > > > WARNING! Dry-Runs are snapshots of a certain point in time and are > bound > > > > to the current inventory setup. If any of these conditions change, > the > > > > preview will be invalid. Please make sure to have a minimal > > > > timeframe between planning and applying the specs. > > > > #################### > > > > SERVICESPEC PREVIEWS > > > > #################### > > > > +---------+------+--------+-------------+ > > > > |SERVICE |NAME |ADD_TO |REMOVE_FROM | > > > > +---------+------+--------+-------------+ > > > > +---------+------+--------+-------------+ > > > > ################ > > > > OSDSPEC PREVIEWS > > > > ################ > > > > +---------+------+------+------+----+-----+ > > > > |SERVICE |NAME |HOST |DATA |DB |WAL | > > > > +---------+------+------+------+----+-----+ > > > > > > > > I do not know why ceph will not use this disk, and I do not know > where to > > > > look. It seems logs are not saying anything. And the weirdest thing, > > > > another disk was replaced on the same machine, and it went without > any > > > > issues. > > > > > > > > Luis Domingues > > > > Proton AG > > > > _______________________________________________ > > > > ceph-users mailing list -- ceph-users@xxxxxxx > > > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > > > > > _______________________________________________ > > > ceph-users mailing list -- ceph-users@xxxxxxx > > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx