Re: The journey to CephFS metadata pool’s recovery

Marco Faggian <m@xxxxxxxxxxxxxxxx> · Tue, 03 Sep 2024 09:56:48 +0000 (UTC)

Hi Frédéric,

Thanks a lot for the pointers!

So, using testdisk I’ve created images of both the LVs. I’ve looked at the hexdump and it’s filled with 0x00 until 00a00000. 
Then for curiosity I’ve compared them and they’re identical until byte 12726273. 

Also unfortunately the issue is that ceph-bluestore-tool show-label, like dumpe2fs -h /dev/ceph-3.. are both erroring out in the same way:
unable to read label for /dev/ceph-454751de-44ab-4aa6-b3ae-50abc22250b3/osd-block-b7745d63-0bf8-4ba4-9274-e034f1c15d7b: 2024-09-03T11:27:22.938+0200 7f2d89f19a00 -1 bluestore(/dev/ceph-454751de-44ab-4aa6-b3ae-50abc22250b3/osd-block-b7745d63-0bf8-4ba4-9274-e034f1c15d7b) _read_bdev_label unable to decode label at offset 102: void bluestore_bdev_label_t::decode(ceph::buffer::v15_2_0::list::const_iterator&) decode past end of struct encoding: Malformed input [buffer:3]
(2) No such file or directory

Unfortunately the thread on the tracker doesn’t seem to point to a solution, even though the issue seems to be identical.

The hexdump explains why it’s not finding the label, might it be that the LV is not correctly mapped?

Basically here the question is: is there a way to recover the data of an OSD in an LV, if it was ceph osd purge before the cluster had a chance to replicate it (after ceph osd out)?

Thanks for your time!
fm

> On 3 Sep 2024, at 10:35, Frédéric Nass <frederic.nass@xxxxxxxxxxxxxxxx> wrote:
> 
> Hi Marco,
> 
> Have you checked the output of:
> 
> dd if=/dev/ceph-xxxxxxx/osd-block-xxxxxxxxx of=/tmp/foo bs=4K count=2
> hexdump -C /tmp/foo
> 
> and:
> 
> /usr/bin/ceph-bluestore-tool show-label --log-level=30 --dev /dev/nvmexxx -l /var/log/ceph/ceph-volume.log
> 
> to see if it's aligned with OSD's metadata.
> 
> You may also want to check this discussion [1] and this tracker [2] for useful commands.
> 
> Regards,
> Frédéric
> 
> [1] https://u44306567.ct.sendgrid.net/ls/click?upn=u001.OhTBFyWWodn3bJkXMXGAv-2BZHpSeLvUgnUfQYY8aOtHqL2EkJ0tRYReC17tg7ZWuqCh7Q4Ai4neLPkzw5-2BDSn2w-3D-3DyCXo_vcBilSDAMc7vbwkuA7HwLqSanoCbWsmbmPjAUA74GgGklME7KqAcKOJTqQrs9ibRxgHi7iIpd1RwoELwFal1hOyRF4WG89ePOrnAYKRApi0p5N6sTCgiyRAEQt2ObOQot-2Fyscfzyz6MieMVaGAEJ7VCllNYmcv9dNUkiMfhvlx8e7p9oaFQEfbyNIRmbFq8o76uCYY9tW-2F5F55CKIrHa8A-3D-3D
> [2] https://u44306567.ct.sendgrid.net/ls/click?upn=u001.OhTBFyWWodn3bJkXMXGAv52B3EPWQYPsrY2jPMbn-2FFFEy1wfXUqafqHSMMXybus8cbCx9aWnQcNpmdEOSnGhdQ-3D-3Dr3JK_vcBilSDAMc7vbwkuA7HwLqSanoCbWsmbmPjAUA74GgGklME7KqAcKOJTqQrs9ibRIU-2BxbmHkiB4l9WOiEMOcHYPjTp3d-2FBlL26YghDgjD-2FItN27RfmloYmoyeuHiEJFxE85O0EnGGeGnqcgcadlejVEXr5zIprHsqfoSVIun-2BuZCtzmJnBSulY0xvNSmawxfPYz2Vz4eZMTv187omUV2Hw-3D-3D
> 
> ----- Le 2 Sep 24, à 11:26,  m@xxxxxxxxxxxxxxxx a écrit :
> 
>>> FYI: Also posted in L1Techs forum:
>>> https://u44306567.ct.sendgrid.net/ls/click?upn=u001.OhTBFyWWodn3bJkXMXGAv0dbxLESmgzlCH2uV9FGVgYCqnBgx79OK77bKwb3msPoq1C21WCYHtV76m8lwbTKROtyZySr4X3JxJ-2Fl9Hx-2B5Ehx3Jpf1XQvGULbY87yY4Yxcmsh_vcBilSDAMc7vbwkuA7HwLqSanoCbWsmbmPjAUA74GgGklME7KqAcKOJTqQrs9ibRXbE6IL1j-2By3MA8z-2BLkOHzDleiC4tcse50yEqUVe-2B2kDcpdPz8c8Luz4LtGyv4Wl-2BLVN9BHHuesPqKN-2FL-2FRTTUmF6276Ye1uTExB3hMm111DKh5YKMAJIHppByp4igBs3XvUGRMj-2BkX5sSbLWR819KQ-3D-3D
>> 
>> ## The epic intro
>> Through self-inflicted pain, I’m writing here to ask for volunteers in the
>> journey of recovering the lost partitions housing CephFS metadata pool.
>> 
>> ## The setup
>> 1 proxmox host (I know)
>> 1 replication rule only for NVMes (2x OSD)
>> 1 replication rule only for HDDs (8x OSD)
>> Each with failure domain to osd.
>> Each OSD configured to use bluestore backend in an LVM.
>> No backup (I know, I know).
>> 
>> ## The cause (me)
>> Long story short: I needed the PCIe lanes and decided to remove the two NVMEs
>> that were hosting the metadata pool for CephFS and .mgr pool. I proceeded to
>> remove the two OSDs (out and destroy).
>> This is where I’ve done goof: I didn’t change the replication rule to HDDs’ one,
>> so the cluster never moved the PGs stored in the NVMes, to the HDDs.
>> 
>> ## What I’ve done untill now
>> 1. Re-seated the NVMes to their original place.
>> 2. Found out that the LVM didn’t have the OSD’s labels applied
>> 3. Forced the backed-up LVM config to the two NVMes (thanks to the holy entity
>> that thought that archiving LVM config was a good thing, it payed back)
>> 4. Trying ceph-volume lvm activate 8 <id> to find out that it’s unable to decode
>> label at offset 102 in the LVM for that ODS.
>> 
>> ## Wishes
>> 1. Does anyone know a way to recover what I feel is a lost partition, given that
>> the “file system” is ceph’s bluestore?
>> 2. Is there a way to know, if it is, how the partition has been nuked? And
>> possibly find a way to reverse that process.
>> 
>> ## Closing statement
>> Eternal reminder: If you don’t want to lose it, back it up.
>> Thanks for your time, to the kind souls that are willing to die on this hill
>> with me, or come up victorious!
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> 

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx