Re: CephFS file to rados object mapping

Gregory Farnum <gfarnum@xxxxxxxxxx> · Wed, 21 Oct 2015 14:44:08 -0700

On Wed, Oct 14, 2015 at 7:20 PM, Francois Lafont <flafdivers@xxxxxxx> wrote:
> Hi,
>
> On 14/10/2015 06:45, Gregory Farnum wrote:
>
>>> Ok, however during my tests I had been careful to replace the correct
>>> file by a bad file with *exactly* the same size (the content of the
>>> file was just a little string and I have changed it by a string with
>>> exactly the same size). I had been careful to undo the mtime update
>>> too (I had restore the mtime of the file before the change). Despite
>>> this, the "repair" command worked well. Tested twice: 1. with the change
>>> on the primary OSD and 2. on the secondary OSD. And I was surprised
>>> because I though the test 1. (in primary OSD) will fail.
>>
>> Hm. I'm a little confused by that, actually. Exactly what was the path
>> to the files you changed, and do you have before-and-after comparisons
>> on the content and metadata?
>
> I didn't remember exactly the process I have made so I have just retried
> today. Here is my process. I have a healthy cluster with 3 nodes (Ubuntu
> Trusty) and I have ceph Hammer (version 0.94.3). I have mounted cephfs on
> /mnt on one of the nodes.
>
> ~# cat /mnt/file.txt # yes it's a little file. ;)
> 123456
>
> ~# ls -i /mnt/file.txt
> 1099511627776 /mnt/file.txt
>
> ~# printf "%x\n" 1099511627776
> 10000000000
>
> ~# rados -p data ls - | grep 10000000000
> 10000000000.00000000
>
> I have the name of the object mapped to my "file.txt".
>
> ~# ceph osd map data 10000000000.00000000
> osdmap e76 pool 'data' (3) object '10000000000.00000000' -> pg 3.f0b56f30 (3.30) -> up ([1,2], p1) acting ([1,2], p1)
>
> So my object is in the primary OSD OSD-1 and in the secondary OSD OSD-2.
> So I open a terminal in the node which hosts the primary OSD OSD-1 and
> then:
>
> ~# cat /var/lib/ceph/osd/ceph-1/current/3.30_head/10000000000.00000000__head_F0B56F30__3
> 123456
>
> ~# ll /var/lib/ceph/osd/ceph-1/current/3.30_head/10000000000.00000000__head_F0B56F30__3
> -rw-r--r-- 1 root root 7 Oct 15 03:46 /var/lib/ceph/osd/ceph-1/current/3.30_head/10000000000.00000000__head_F0B56F30__3
>
> Now, I change the content with this script called "change_content.sh" to
> preserve the mtime after the change:
>
> -----------------------------
> #!/bin/sh
>
> f="$1"
> f_tmp="${f}.tmp"
> content="$2"
> cp --preserve=all "$f" "$f_tmp"
> echo "$content" >"$f"
> touch -r "$f_tmp" "$f" # to restore the mtime after the change
> rm "$f_tmp"
> -----------------------------
>
> So, let's go, I replace the content by a new content with exactly
> the same size (ie "ABCDEF" in this example):
>
> ~# ./change_content.sh /var/lib/ceph/osd/ceph-1/current/3.30_head/10000000000.00000000__head_F0B56F30__3 ABCDEF
>
> ~# cat /var/lib/ceph/osd/ceph-1/current/3.30_head/10000000000.00000000__head_F0B56F30__3
> ABCDEF
>
> ~# ll /var/lib/ceph/osd/ceph-1/current/3.30_head/10000000000.00000000__head_F0B56F30__3
> -rw-r--r-- 1 root root 7 Oct 15 03:46 /var/lib/ceph/osd/ceph-1/current/3.30_head/10000000000.00000000__head_F0B56F30__3
>
> Now, the secondary OSD contains the good version of the object and
> the primary a bad version. Now, I launch a "ceph pg repair":
>
> ~# ceph pg repair 3.30
> instructing pg 3.30 on osd.1 to repair
>
> # I'm in the primary OSD and the file below has been repaired correctly.
> ~# cat /var/lib/ceph/osd/ceph-1/current/3.30_head/10000000000.00000000__head_F0B56F30__3
> 123456
>
> As you can see, the repair command has worked well.
> Maybe my little is too trivial?

Hmm, maybe David has some idea.

>
>>> Greg, if I understand you well, I shouldn't have too much confidence in
>>> the "ceph pg repair" command, is it correct?
>>>
>>> But, if yes, what is the good way to repair a PG?
>>
>> Usually what we recommend is for those with 3 copies to find the
>> differing copy, delete it, and run a repair — then you know it'll
>> repair from a good version. But yeah, it's not as reliable as we'd
>> like it to be on its own.
>
> I would like to be sure to well understand. The process could be (in
> the case where size == 3):
>
> 1. In each of the 3 OSDs where my object is put:
>
>     md5sum /var/lib/ceph/osd/ceph-$id/current/${pg_id}_head/${object_name}*
>
> 2. Normally, I will have the same result in 2 OSDs, and in the other
> OSD, let's call it OSD-X, the result will be different. So, in the OSD-X,
> I run:
>
>     rm /var/lib/ceph/osd/ceph-$id/current/${pg_id}_head/${object_name}*
>
> 3. And now I can run the "ceph pg repair" command without risk:
>
>     ceph pg repair $pg_id
>
> Is it the correct process?

Yes, I would expect this to work.
-Greg
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com