Zoltan, It's good to hear that you were able to get the PGs stuck in 'remapped' back into a 'clean' state. Based on your response I'm guessing that your failure domains (node, rack, or maybe row) are too close (or equal) to your replica size. For example if your cluster looks like this: 3 replicas 3 racks (CRUSH set to use racks as the failure domain) rack 1: 3 nodes rack 2: 5 nodes rack 3: 4 nodes Then CRUSH will sometimes have problems making sure each rack has one of the copies (especially if you are doing reweights on OSDs in the first rack). Does that come close to describing your cluster? I believe you're right about how 'ceph repair' works. I've run into this before and one way I went about fixing it was to run md5sum on all the objects in the PG for each OSD and comparing the results. My thinking was that I could track down the inconsistent objects by finding ones where only 2 of the 3 md5's match. ceph-01: cd /var/lib/ceph/osd/ceph-14/current/3.1b0_head find . -type f -exec md5sum '{}' \; | sort -k2 >/tmp/pg_3.1b0-osd.14-md5s.txt ceph-02: cd /var/lib/ceph/osd/ceph-47/current/3.1b0_head find . -type f -exec md5sum '{}' \; | sort -k2 >/tmp/pg_3.1b0-osd.47-md5s.txt ceph-04: cd /var/lib/ceph/osd/ceph-29/current/3.1b0_head find . -type f -exec md5sum '{}' \; | sort -k2 >/tmp/pg_3.1b0-osd.29-md5s.txt Then using vimdiff to do a 3-way diff I was able to find the objects which were different between the OSDs. Based on the that I was able to determine if the repair would cause a problem. I believe if you use btrfs instead of xfs for your filestore backend you'll get proper checksumming, but I don't know if Ceph utilizes that information yet. Plus I've heard btrfs slows down quite a bit over time when used as an OSD. As for Jewel I think the new bluestore backend includes checksums, but someone that's actually using it would have to confirm. Switching to bluestore will involve a lot of rebuilding too. Bryan On 2/15/16, 8:36 AM, "Zoltan Arnold Nagy" <zoltan@xxxxxxxxxxxxxxxxxx> wrote: >Hi Bryan, > >You were right: we¹ve modified our PG weights a little (from 1 to around >0.85 on some OSDs) and once I¹ve changed them back to 1, the remapped PGs >and misplaced objects were gone. >So thank you for the tip. > >For the inconsistent ones and scrub errors, I¹m a little wary to use pg >repair as that - if I understand correctly - only copies the primary PG¹s >data to the other PGs thus can easily corrupt the whole object if the >primary is corrupted. > >I haven¹t seen an update on this since last May where this was brought up >as a concern from several people and there were mentions of adding >checksumming to the metadata and doing a checksum-comparison on repair. > >Can anybody update on the current status on how exactly pg repair works in >Hammer or will work in Jewel? ________________________________ This E-mail and any of its attachments may contain Time Warner Cable proprietary information, which is privileged, confidential, or subject to copyright belonging to Time Warner Cable. This E-mail is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient of this E-mail, you are hereby notified that any dissemination, distribution, copying, or action taken in relation to the contents of and attachments to this E-mail is strictly prohibited and may be unlawful. If you have received this E-mail in error, please notify the sender immediately and permanently delete the original and any copy of this E-mail and any printout. _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com