Re: Recover Data from Deleted RBD Volume

Jason Dillaman <jdillama@xxxxxxxxxx> · Mon, 8 Aug 2016 17:39:45 -0400

All RBD images use a backing RADOS object to facilitate mapping
between the external image name and the internal image id.  For v1
images this object would be named "<image name>.rbd" and for v2 images
 this object would be named "rbd_id.<image name>". You would need to
find this deleted object first in order to start figuring out the
internal image id.

For example, if image "test" was a v1 RBD image in the pool rbd, you would run:
# ceph osd map rbd test.rbd
osdmap e9 pool 'rbd' (0) object 'rbd_id.test' -> pg 0.9a2f7478 (0.0)
-> up ([0,2,1], p0) acting ([0,2,1], p0)

In this example, the object would be placed in PG 0.0 on OSD 0, 2, and
1. Since this object still exists for me, I can locate it on the OSD
disk:
# find /var/lib/ceph/osd/ceph-0/current/0.0_head/ -name "test.rbd*"
/var/lib/ceph/osd/ceph-0/current/0.0_head/test.rbd__head_9A2F7478__0

In your case, you would need to search for a deleted file substring
matching your image name within the appropriate PG directory on the
associated OSDs.

If you can recover the v1 RBD header (e.g. test.rbd), you can dump its
contents and extract the block name prefix. In my example, if I
hexdump my v1 header object, I see:
# hexdump -C /var/lib/ceph/osd/ceph-0/current/0.0_head/test.rbd__head_9A2F7478__0
00000000  3c 3c 3c 20 52 61 64 6f  73 20 42 6c 6f 63 6b 20  |<<< Rados Block |
00000010  44 65 76 69 63 65 20 49  6d 61 67 65 20 3e 3e 3e  |Device Image >>>|
00000020  0a 00 00 00 00 00 00 00  72 62 2e 30 2e 31 30 31  |........rb.0.101|
00000030  30 2e 37 34 62 30 64 63  35 31 00 00 00 00 00 00  |0.74b0dc51......|
00000040  52 42 44 00 30 30 31 2e  30 30 35 00 16 00 00 00  |RBD.001.005.....|
00000050  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000070

All the data blocks associated with this image will be named with a
"rb.0.1010.74b0dc51." prefix followed by a 12 character, zero-padded,
hexadecimal number representing the object offset within the image.
For example, assuming the default 4MB object size,
"rb.0.1010.74b0dc51.0000000000cf" would represent the offset 188MB
through 192MB within the image.

For each one of these object names, you would need to run the "ceph
osd map" command to determine where these data blocks would have lived
and then attempt to undelete them.

Unfortunately, for v2 RBD images, this image name to image id mapping
is stored in the LevelDB database within the OSDs and I don't know,
offhand, how to attempt to recover deleted values from there.

On Mon, Aug 8, 2016 at 4:39 PM, Georgios Dimitrakakis
<giorgis@xxxxxxxxxxxx> wrote:
> Dear David (and all),
>
> the data are considered very critical therefore all this attempt to recover
> them.
>
> Although the cluster hasn't been fully stopped all users actions have. I
> mean services are running but users are not able to read/write/delete.
>
> The deleted image was the exact same size of the example (500GB) but it
> wasn't the only one deleted today. Our user was trying to do a "massive"
> cleanup by deleting 11 volumes and unfortunately one of them was very
> important.
>
> Let's assume that I "dd" all the drives what further actions should I do to
> recover the files? Could you please elaborate a bit more on the phrase "If
> you've never deleted any other rbd images and assuming you can recover data
> with names, you may be able to find the rbd objects"??
>
> Do you mean that if I know the file names I can go through and check for
> them? How?
> Do I have to know *all* file names or by searching for a few of them I can
> find all data that exist?
>
> Thanks a lot for taking the time to answer my questions!
>
> All the best,
>
> G.
>
>> I dont think theres a way of getting the prefix from the cluster at
>> this point.
>>
>> If the deleted image was a similar size to the example youve given,
>> you will likely have had objects on every OSD. If this data is
>> absolutely critical you need to stop your cluster immediately or make
>> copies of all the drives with something like dd. If youve never
>> deleted any other rbd images and assuming you can recover data with
>> names, you may be able to find the rbd objects.
>>
>> On Mon, Aug 8, 2016 at 7:28 PM, Georgios Dimitrakakis  wrote:
>>
>>>>> Hi,
>>>>>
>>>>> On 08.08.2016 10:50, Georgios Dimitrakakis wrote:
>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> On 08.08.2016 09:58, Georgios Dimitrakakis wrote:
>>>>>>>
>>>>>>>> Dear all,
>>>>>>>>
>>>>>>>> I would like your help with an emergency issue but first
>>>>>>>> let me describe our environment.
>>>>>>>>
>>>>>>>> Our environment consists of 2OSD nodes with 10x 2TB HDDs
>>>>>>>> each and 3MON nodes (2 of them are the OSD nodes as well)
>>>>>>>> all with ceph version 0.80.9
>>>>>>>> (b5a67f0e1d15385bc0d60a6da6e7fc810bde6047)
>>>>>>>>
>>>>>>>> This environment provides RBD volumes to an OpenStack
>>>>>>>> Icehouse installation.
>>>>>>>>
>>>>>>>> Although not a state of the art environment is working
>>>>>>>> well and within our expectations.
>>>>>>>>
>>>>>>>> The issue now is that one of our users accidentally
>>>>>>>> deleted one of the volumes without keeping its data first!
>>>>>>>>
>>>>>>>> Is there any way (since the data are considered critical
>>>>>>>> and very important) to recover them from CEPH?
>>>>>>>
>>>>>>>
>>>>>>> Short answer: no
>>>>>>>
>>>>>>> Long answer: no, but....
>>>>>>>
>>>>>>> Consider the way Ceph stores data... each RBD is striped
>>>>>>> into chunks
>>>>>>> (RADOS objects with 4MB size by default); the chunks are
>>>>>>> distributed
>>>>>>> among the OSDs with the configured number of replicates
>>>>>>> (probably two
>>>>>>> in your case since you use 2 OSD hosts). RBD uses thin
>>>>>>> provisioning,
>>>>>>> so chunks are allocated upon first write access.
>>>>>>> If an RBD is deleted all of its chunks are deleted on the
>>>>>>> corresponding OSDs. If you want to recover a deleted RBD,
>>>>>>> you need to
>>>>>>> recover all individual chunks. Whether this is possible
>>>>>>> depends on
>>>>>>> your filesystem and whether the space of a former chunk is
>>>>>>> already
>>>>>>> assigned to other RADOS objects. The RADOS object names are
>>>>>>> composed
>>>>>>> of the RBD name and the offset position of the chunk, so if
>>>>>>> an
>>>>>>> undelete mechanism exists for the OSDs filesystem, you have
>>>>>>> to be
>>>>>>> able to recover file by their filename, otherwise you might
>>>>>>> end up
>>>>>>> mixing the content of various deleted RBDs. Due to the thin
>>>>>>> provisioning there might be some chunks missing (e.g. never
>>>>>>> allocated
>>>>>>> before).
>>>>>>>
>>>>>>> Given the fact that
>>>>>>> - you probably use XFS on the OSDs since it is the
>>>>>>> preferred
>>>>>>> filesystem for OSDs (there is RDR-XFS, but Ive never had to
>>>>>>>
>>>>>>> use it)
>>>>>>> - you would need to stop the complete ceph cluster
>>>>>>> (recovery tools do
>>>>>>> not work on mounted filesystems)
>>>>>>> - your cluster has been in use after the RBD was deleted
>>>>>>> and thus
>>>>>>> parts of its former space might already have been
>>>>>>> overwritten
>>>>>>> (replication might help you here, since there are two OSDs
>>>>>>> to try)
>>>>>>> - XFS undelete does not work well on fragmented files (and
>>>>>>> OSDs tend
>>>>>>> to introduce fragmentation...)
>>>>>>>
>>>>>>> the answer is no, since it might not be feasible and the
>>>>>>> chance of
>>>>>>> success are way too low.
>>>>>>>
>>>>>>> If you want to spend time on it I would propose the stop
>>>>>>> the ceph
>>>>>>> cluster as soon as possible, create copies of all involved
>>>>>>> OSDs, start
>>>>>>> the cluster again and attempt the recovery on the copies.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Burkhard
>>>>>>
>>>>>>
>>>>>> Hi! Thanks for the info...I understand that this is a very
>>>>>> difficult and probably not feasible task but in case I need to
>>>>>> try a recovery what other info should I need? Can I somehow
>>>>>> find out on which OSDs the specific data were stored and
>>>>>> minimize my search there?
>>>>>> Any ideas on how should I proceed?
>>>>>
>>>>> First of all you need to know the exact object names for the
>>>>> RADOS
>>>>> objects. As mentioned before, the name is composed of the RBD
>>>>> name and
>>>>> an offset.
>>>>>
>>>>> In case of OpenStack, there are three different patterns for
>>>>> RBD names:
>>>>>
>>>>> , e.g. 50f2a0bd-15b1-4dbb-8d1f-fc43ce535f13
>>>>> for glance images,
>>>>> , e.g. 9aec1f45-9053-461e-b176-c65c25a48794_disk for nova
>>>>> images
>>>>> , e.g. volume-0ca52f58-7e75-4b21-8b0f-39cbcd431c42 for
>>>>> cinder volumes
>>>>>
>>>>> (not considering snapshots etc, which might use different
>>>>> patterns)
>>>>>
>>>>> The RBD chunks are created using a certain prefix (using
>>>>> examples
>>>>> from our openstack setup):
>>>>>
>>>>> # rbd -p os-images info 8fa3d9eb-91ed-4c60-9550-a62f34aed014
>>>>> rbd image 8fa3d9eb-91ed-4c60-9550-a62f34aed014:
>>>>>     size 446 MB in 56 objects
>>>>>     order 23 (8192 kB objects)
>>>>>     block_name_prefix: rbd_data.30e57d54dea573
>>>>>     format: 2
>>>>>     features: layering, striping
>>>>>     flags:
>>>>>     stripe unit: 8192 kB
>>>>>     stripe count: 1
>>>>>
>>>>> # rados -p os-images ls | grep rbd_data.30e57d54dea573
>>>>> rbd_data.30e57d54dea573.0000000000000015
>>>>> rbd_data.30e57d54dea573.0000000000000008
>>>>> rbd_data.30e57d54dea573.000000000000000a
>>>>> rbd_data.30e57d54dea573.000000000000002d
>>>>> rbd_data.30e57d54dea573.0000000000000032
>>>>>
>>>>> I dont know how whether the prefix is derived from some other
>>>>>
>>>>> information, but the recover the RBD you definitely need it.
>>>>>
>>>>> _If_ you are able to recover the prefix, you can use ceph osd
>>>>> map
>>>>> to find the OSDs for each chunk:
>>>>>
>>>>> # ceph osd map os-images
>>>>> rbd_data.30e57d54dea573.000000000000001a
>>>>> osdmap e418590 pool os-images (38) object
>>>>> rbd_data.30e57d54dea573.000000000000001a -> pg 38.d5d81d65
>>>>> (38.65)
>>>>> -> up ([45,17,108], p45) acting ([45,17,108], p45)
>>>>>
>>>>> With 20 OSDs in your case you will likely have to process all
>>>>> of them
>>>>> if the RBD has a size of several GBs.
>>>>>
>>>>> Regards,
>>>>> Burkhard
>>>>
>>>>
>>>> Is it possible to get the prefix if the RBD has been deleted
>>>> already?? Is this info somewhere stored? Can I retrieve it with
>>>> another way besides "rbd info"? Because when I try to get it
>>>> using the
>>>> "rbd info" command unfortunately I am getting the following
>>>> error:
>>>>
>>>> "librbd::ImageCtx: error finding header: (2) No such file or
>>>> directory"
>>>>
>>>> Any ideas?
>>>>
>>>> Best regards,
>>>>
>>>> G.
>>>
>>>
>>> Here are some more info from the cluster:
>>>
>>> $ ceph df
>>> GLOBAL:
>>>     SIZE       AVAIL      RAW USED     %RAW USED
>>>     74373G     72011G        2362G          3.18
>>> POOLS:
>>>     NAME                   ID     USED
>>> %USED     MAX AVAIL     OBJECTS
>>>     data                   3          0
>>>  0        35849G          0
>>>     metadata               4       1884
>>>  0        35849G         20
>>>     rbd                    5          0
>>>  0        35849G          0
>>>     .rgw                   6       1374
>>>  0        35849G          8
>>>     .rgw.control           7          0
>>>  0        35849G          8
>>>     .rgw.gc                8          0
>>>  0        35849G         32
>>>     .log                   9          0
>>>  0        35849G          0
>>>     .intent-log            10         0
>>>  0        35849G          0
>>>     .usage                 11         0
>>>  0        35849G          3
>>>     .users                 12        33
>>>  0        35849G          3
>>>     .users.email           13        22
>>>  0        35849G          2
>>>     .users.swift           14        22
>>>  0        35849G          2
>>>     .users.uid             15       985
>>>  0        35849G          4
>>>     .rgw.root              16       840
>>>  0        35849G          3
>>>     .rgw.buckets.index     17         0         0
>>>       35849G          4
>>>     .rgw.buckets           18      170G      0.23
>>>       35849G      810128
>>>     .rgw.buckets.extra     19         0         0
>>>       35849G          1
>>>     volumes                20     1004G      1.35
>>>       35849G      262613
>>>
>>> Obviously the RBD volumes provided to OpenStack are stored on the
>>> "volumes" pool , so trying to
>>> figure out the prefix for the volume in question
>>> "volume-a490aa0c-6957-4ea2-bb5b-e4054d3765ad" produces the
>>> following:
>>>
>>> $ rbd -p volumes info volume-a490aa0c-6957-4ea2-bb5b-e4054d3765ad
>>> rbd: error opening image
>>> volume-a490aa0c-6957-4ea2-bb5b-e4054d3765ad: (2) No such file or
>>> directory
>>> 2016-08-09 03:04:56.250977 7fa9ba1ca760 -1 librbd::ImageCtx: error
>>> finding header: (2) No such file or directory
>>>
>>> On the other hand for a volume that already exists and is working
>>> normally since I get the following:
>>>
>>> $ rbd -p volumes info volume-2383fc3a-2b6f-49b4-a3f5-f840569edb73
>>> rbd image volume-2383fc3a-2b6f-49b4-a3f5-f840569edb73:
>>>         size 500 GB in 128000 objects
>>>         order 22 (4096 kB objects)
>>>         block_name_prefix: rbd_data.fb1bb3136c3ec
>>>         format: 2
>>>         features: layering
>>>
>>> and can also get the OSD mapping etc.
>>>
>>> Does that mean that there is no way to find out on which OSDs the
>>> deleted volume was placed?
>>> If thats the case then its not possible to recover the data...Am I
>>> right???
>>>
>>> Any other ideas people???
>>>
>>> Looking forward for your comments...please...
>>>
>>> Best regards,
>>>
>>> G.
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx [1]
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com [2]
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Jason
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com