Hi, On 14/10/2015 06:45, Gregory Farnum wrote: >> Ok, however during my tests I had been careful to replace the correct >> file by a bad file with *exactly* the same size (the content of the >> file was just a little string and I have changed it by a string with >> exactly the same size). I had been careful to undo the mtime update >> too (I had restore the mtime of the file before the change). Despite >> this, the "repair" command worked well. Tested twice: 1. with the change >> on the primary OSD and 2. on the secondary OSD. And I was surprised >> because I though the test 1. (in primary OSD) will fail. > > Hm. I'm a little confused by that, actually. Exactly what was the path > to the files you changed, and do you have before-and-after comparisons > on the content and metadata? I didn't remember exactly the process I have made so I have just retried today. Here is my process. I have a healthy cluster with 3 nodes (Ubuntu Trusty) and I have ceph Hammer (version 0.94.3). I have mounted cephfs on /mnt on one of the nodes. ~# cat /mnt/file.txt # yes it's a little file. ;) 123456 ~# ls -i /mnt/file.txt 1099511627776 /mnt/file.txt ~# printf "%x\n" 1099511627776 10000000000 ~# rados -p data ls - | grep 10000000000 10000000000.00000000 I have the name of the object mapped to my "file.txt". ~# ceph osd map data 10000000000.00000000 osdmap e76 pool 'data' (3) object '10000000000.00000000' -> pg 3.f0b56f30 (3.30) -> up ([1,2], p1) acting ([1,2], p1) So my object is in the primary OSD OSD-1 and in the secondary OSD OSD-2. So I open a terminal in the node which hosts the primary OSD OSD-1 and then: ~# cat /var/lib/ceph/osd/ceph-1/current/3.30_head/10000000000.00000000__head_F0B56F30__3 123456 ~# ll /var/lib/ceph/osd/ceph-1/current/3.30_head/10000000000.00000000__head_F0B56F30__3 -rw-r--r-- 1 root root 7 Oct 15 03:46 /var/lib/ceph/osd/ceph-1/current/3.30_head/10000000000.00000000__head_F0B56F30__3 Now, I change the content with this script called "change_content.sh" to preserve the mtime after the change: ----------------------------- #!/bin/sh f="$1" f_tmp="${f}.tmp" content="$2" cp --preserve=all "$f" "$f_tmp" echo "$content" >"$f" touch -r "$f_tmp" "$f" # to restore the mtime after the change rm "$f_tmp" ----------------------------- So, let's go, I replace the content by a new content with exactly the same size (ie "ABCDEF" in this example): ~# ./change_content.sh /var/lib/ceph/osd/ceph-1/current/3.30_head/10000000000.00000000__head_F0B56F30__3 ABCDEF ~# cat /var/lib/ceph/osd/ceph-1/current/3.30_head/10000000000.00000000__head_F0B56F30__3 ABCDEF ~# ll /var/lib/ceph/osd/ceph-1/current/3.30_head/10000000000.00000000__head_F0B56F30__3 -rw-r--r-- 1 root root 7 Oct 15 03:46 /var/lib/ceph/osd/ceph-1/current/3.30_head/10000000000.00000000__head_F0B56F30__3 Now, the secondary OSD contains the good version of the object and the primary a bad version. Now, I launch a "ceph pg repair": ~# ceph pg repair 3.30 instructing pg 3.30 on osd.1 to repair # I'm in the primary OSD and the file below has been repaired correctly. ~# cat /var/lib/ceph/osd/ceph-1/current/3.30_head/10000000000.00000000__head_F0B56F30__3 123456 As you can see, the repair command has worked well. Maybe my little is too trivial? >> Greg, if I understand you well, I shouldn't have too much confidence in >> the "ceph pg repair" command, is it correct? >> >> But, if yes, what is the good way to repair a PG? > > Usually what we recommend is for those with 3 copies to find the > differing copy, delete it, and run a repair — then you know it'll > repair from a good version. But yeah, it's not as reliable as we'd > like it to be on its own. I would like to be sure to well understand. The process could be (in the case where size == 3): 1. In each of the 3 OSDs where my object is put: md5sum /var/lib/ceph/osd/ceph-$id/current/${pg_id}_head/${object_name}* 2. Normally, I will have the same result in 2 OSDs, and in the other OSD, let's call it OSD-X, the result will be different. So, in the OSD-X, I run: rm /var/lib/ceph/osd/ceph-$id/current/${pg_id}_head/${object_name}* 3. And now I can run the "ceph pg repair" command without risk: ceph pg repair $pg_id Is it the correct process? -- François Lafont _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com