Mishap after disk replacement, db and block split into separate OSD's in ceph-volume

Mikael Öhman <micketeer@xxxxxxxxx> · Wed, 5 Jul 2023 12:19:35 +0200

We had a fauly disk which was causing many errors, and replacement took a
while so we had to try to stop ceph from using the OSD in during this time.
However I think we must have done that wrong and after the disk replacement
our ceph orch seems to have picked up /dev/sdp and added the a new osd and
automatically (588), without a separate DB device (since that was still
taken by the old OSD 31 maybe? I'm not sure how to ).
This led to issues where osd31 of course wouldn't start, and some actions
were attempted to clear this out, which might have just caused more harm.

Long story short, we are currently in a odd position where we have still
have ceph-volume lvm list osd.31 with only a [db] section:
====== osd.31 ======

 [db]
         /dev/ceph-1b309b1e-a4a6-4861-b16c-7c06ecde1a3d/osd-db-fb09a714-f955-4418-99f2-6bccd8c6220e

     block device
             /dev/ceph-48f7dbd8-4a7c-4f7e-8962-104e756ae864/osd-block-33538b36-52b3-421d-bf66-6c729a057707

     block uuid                bykFYi-z8T6-OWXp-i1OB-H7CE-uLDm-Td6QTI
     cephx lockbox secret
     cluster fsid              5406fed0-d52b-11ec-beff-7ed30a54847b
     cluster name              ceph
     crush device class        None
     db device

/dev/ceph-1b309b1e-a4a6-4861-b16c-7c06ecde1a3d/osd-db-fb09a714-f955-4418-99f2-6bccd8c6220e

     db uuid                   Vy3aOA-qseQ-RIDT-741e-z7o0-y376-kKTXRE
     encrypted                 0
     osd fsid                  33538b36-52b3-421d-bf66-6c729a057707
     osd id                    31
     osdspec affinity          osd_spec
     type                      db
     vdo                       0
     devices                   /dev/nvme0n1

and a seperate extra osd.588 (which is running) which has taken only the
[block] device

===== osd.588 ======

 [block]
      /dev/ceph-f63ef837-3b18-47a4-be55-d5c2c0db8927/osd-block-58b33b8f-9623-46b3-a86a-3061602a76b5

     block device
             /dev/ceph-f63ef837-3b18-47a4-be55-d5c2c0db8927/osd-block-58b33b8f-9623-46b3-a86a-3061602a76b5

     block uuid                KYHzBq-zgJJ-Nw93-j7Jx-Oz5i-BMuU-ndtTCH
     cephx lockbox secret
     cluster fsid              5406fed0-d52b-11ec-beff-7ed30a54847b
     cluster name              ceph
     crush device class
     encrypted                 0
     osd fsid                  58b33b8f-9623-46b3-a86a-3061602a76b5
     osd id                    588
     osdspec affinity          all-available-devices
     type                      block
     vdo                       0
     devices                   /dev/sdp

I figured the best action was to clear out both of these faulty OSDs via
orch "ceph orch osd rm XX" but osd 31 isn't recognized

[ceph: root@mimer-osd01 /]# ceph orch osd rm 31
Unable to find OSDs: ['31']

Deleting 588 is recognized. Should I attempt to clear out the osd.31 from
ceph-volume manually?
I'd really like to get back to a situation where I have osd.31 with the osd
fsid that matches the device names, with /dev/sdp and /dev/nmve0n1 but I'm
really afraid of just breaking things even more.

>From what i can see from files laying around, the OSD spec we have is
simply:
placement:
 host_pattern: "mimer-osd01"
service_id: osd_spec
service_type: osd
spec:
 data_devices:
   rotational: 1
 db_devices:
   rotational: 0
in case this matters. I appreciate any help or guidance.

Best regards, Mikael
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx