Re: The journey to CephFS metadata pool’s recovery

Frédéric Nass <frederic.nass@xxxxxxxxxxxxxxxx> · Tue, 3 Sep 2024 15:58:48 +0200 (CEST)

Have you tried to hexdump the actual NVMEs devices instead of the testdisk images? 

`hexdump -C -n 4096 /dev/nvmexxx` should show LVM LABELONE with PV UUID 

and 

`hexdump -C -n 4096 /dev/ceph-454751de-44ab-4aa6-b3ae-50abc22250b3/osd-block-b7745d63-0bf8-4ba4-9274-e034f1c15d7b` should show bluestore metadata AKA 'label' that you can get with `ceph-bluestore-tool show-label` 

If nothing shows up than you may need to rewrite the label as described here [1] by Igor. Thing is, you'd have to make sure you're dealing with the right disk, meaning both NVMEs were not swapped. 

To help you with that part, you could run `strings /dev/nvmexxx | less` and try to link the VG informations with OSD config files if you're able to. 

Hope that helps. 

Regards, 
Frédéric. 

[1] https://www.spinics.net/lists/ceph-users/msg81813.html 

----- Le 3 Sep 24, à 11:56, Marco Faggian <m@xxxxxxxxxxxxxxxx> a écrit : 

> Hi Frédéric,
> Thanks a lot for the pointers!

> So, using testdisk I’ve created images of both the LVs. I’ve looked at the
> hexdump and it’s filled with 0x00 until 00a00000.
> Then for curiosity I’ve compared them and they’re identical until byte 12726273.

> Also unfortunately the issue is that ceph-bluestore-tool show-label, like
> dumpe2fs -h /dev/ceph-3.. are both erroring out in the same way:

> unable to read label for
> /dev/ceph-454751de-44ab-4aa6-b3ae-50abc22250b3/osd-block-b7745d63-0bf8-4ba4-9274-e034f1c15d7b:
> 2024-09-03T11:27:22.938+0200 7f2d89f19a00 -1
> bluestore(/dev/ceph-454751de-44ab-4aa6-b3ae-50abc22250b3/osd-block-b7745d63-0bf8-4ba4-9274-e034f1c15d7b)
> _read_bdev_label unable to decode label at offset 102: void
> bluestore_bdev_label_t::decode(ceph::buffer::v15_2_0::list::const_iterator&)
> decode past end of struct encoding: Malformed input [buffer:3]

> (2) No such file or directory

> Unfortunately the thread on the tracker doesn’t seem to point to a solution,
> even though the issue seems to be identical.

> The hexdump explains why it’s not finding the label, might it be that the LV is
> not correctly mapped?

> Basically here the question is: is there a way to recover the data of an OSD in
> an LV, if it was ceph osd purge before the cluster had a chance to replicate it
> (after ceph osd out )?

> Thanks for your time!
> fm

>> On 3 Sep 2024, at 10:35, Frédéric Nass <frederic.nass@xxxxxxxxxxxxxxxx> wrote:

>> Hi Marco,

>> Have you checked the output of:

>> dd if=/dev/ceph-xxxxxxx/osd-block-xxxxxxxxx of=/tmp/foo bs=4K count=2
>> hexdump -C /tmp/foo

>> and:

>> /usr/bin/ceph-bluestore-tool show-label --log-level=30 --dev /dev/nvmexxx -l
>> /var/log/ceph/ceph-volume.log

>> to see if it's aligned with OSD's metadata.

>> You may also want to check this discussion [1] and this tracker [2] for useful
>> commands.

>> Regards,
>> Frédéric

>> [1] https://marc.info/?l=ceph-users&m=171395775626007&w=2
>> [2] https://bugzilla.redhat.com/show_bug.cgi?id=1755956

>> ----- Le 2 Sep 24, à 11:26, m@xxxxxxxxxxxxxxxx a écrit :

>>>> FYI: Also posted in L1Techs forum:
>>>> https://forum.level1techs.com/t/recover-bluestore-osd-in-ceph-cluster/215715.

>>> ## The epic intro
>>> Through self-inflicted pain, I’m writing here to ask for volunteers in the
>>> journey of recovering the lost partitions housing CephFS metadata pool.

>>> ## The setup
>>> 1 proxmox host (I know)
>>> 1 replication rule only for NVMes (2x OSD)
>>> 1 replication rule only for HDDs (8x OSD)
>>> Each with failure domain to osd.
>>> Each OSD configured to use bluestore backend in an LVM.
>>> No backup (I know, I know).

>>> ## The cause (me)
>>> Long story short: I needed the PCIe lanes and decided to remove the two NVMEs
>>> that were hosting the metadata pool for CephFS and .mgr pool. I proceeded to
>>> remove the two OSDs (out and destroy).
>>> This is where I’ve done goof: I didn’t change the replication rule to HDDs’ one,
>>> so the cluster never moved the PGs stored in the NVMes, to the HDDs.

>>> ## What I’ve done untill now
>>> 1. Re-seated the NVMes to their original place.
>>> 2. Found out that the LVM didn’t have the OSD’s labels applied
>>> 3. Forced the backed-up LVM config to the two NVMes (thanks to the holy entity
>>> that thought that archiving LVM config was a good thing, it payed back)
>>> 4. Trying ceph-volume lvm activate 8 <id> to find out that it’s unable to decode
>>> label at offset 102 in the LVM for that ODS.

>>> ## Wishes
>>> 1. Does anyone know a way to recover what I feel is a lost partition, given that
>>> the “file system” is ceph’s bluestore?
>>> 2. Is there a way to know, if it is, how the partition has been nuked? And
>>> possibly find a way to reverse that process.

>>> ## Closing statement
>>> Eternal reminder: If you don’t want to lose it, back it up.
>>> Thanks for your time, to the kind souls that are willing to die on this hill
>>> with me, or come up victorious!
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx