Re: pg repair behavior? (Was: Re: getting rid of misplaced objects)

"Stillwell, Bryan" <bryan.stillwell@xxxxxxxxxxx> · Tue, 16 Feb 2016 22:38:07 +0000

Zoltan,

It's good to hear that you were able to get the PGs stuck in 'remapped'
back into a 'clean' state.  Based on your response I'm guessing that your
failure domains (node, rack, or maybe row) are too close (or equal) to
your replica size.

For example if your cluster looks like this:

3 replicas
3 racks (CRUSH set to use racks as the failure domain)
  rack 1: 3 nodes
  rack 2: 5 nodes
  rack 3: 4 nodes

Then CRUSH will sometimes have problems making sure each rack has one of
the copies (especially if you are doing reweights on OSDs in the first
rack).  Does that come close to describing your cluster?

I believe you're right about how 'ceph repair' works.  I've run into this
before and one way I went about fixing it was to run md5sum on all the
objects in the PG for each OSD and comparing the results.  My thinking was
that I could track down the inconsistent objects by finding ones where
only 2 of the 3 md5's match.

ceph-01:
  cd /var/lib/ceph/osd/ceph-14/current/3.1b0_head
  find . -type f -exec md5sum '{}' \; | sort -k2
>/tmp/pg_3.1b0-osd.14-md5s.txt
ceph-02:
  cd /var/lib/ceph/osd/ceph-47/current/3.1b0_head
  find . -type f -exec md5sum '{}' \; | sort -k2
>/tmp/pg_3.1b0-osd.47-md5s.txt
ceph-04:
  cd /var/lib/ceph/osd/ceph-29/current/3.1b0_head
  find . -type f -exec md5sum '{}' \; | sort -k2
>/tmp/pg_3.1b0-osd.29-md5s.txt

Then using vimdiff to do a 3-way diff I was able to find the objects which
were different between the OSDs.  Based on the that I was able to
determine if the repair would cause a problem.

I believe if you use btrfs instead of xfs for your filestore backend
you'll get proper checksumming, but I don't know if Ceph utilizes that
information yet.  Plus I've heard btrfs slows down quite a bit over time
when used as an OSD.

As for Jewel I think the new bluestore backend includes checksums, but
someone that's actually using it would have to confirm.  Switching to
bluestore will involve a lot of rebuilding too.

Bryan

On 2/15/16, 8:36 AM, "Zoltan Arnold Nagy" <zoltan@xxxxxxxxxxxxxxxxxx>
wrote:

>Hi Bryan,
>
>You were right: we¹ve modified our PG weights a little (from 1 to around
>0.85 on some OSDs) and once I¹ve changed them back to 1, the remapped PGs
>and misplaced objects were gone.
>So thank you for the tip.
>
>For the inconsistent ones and scrub errors, I¹m a little wary to use pg
>repair as that - if I understand correctly - only copies the primary PG¹s
>data to the other PGs thus can easily corrupt the whole object if the
>primary is corrupted.
>
>I haven¹t seen an update on this since last May where this was brought up
>as a concern from several people and there were mentions of adding
>checksumming to the metadata and doing a checksum-comparison on repair.
>
>Can anybody update on the current status on how exactly pg repair works in
>Hammer or will work in Jewel?

________________________________

This E-mail and any of its attachments may contain Time Warner Cable proprietary information, which is privileged, confidential, or subject to copyright belonging to Time Warner Cable. This E-mail is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient of this E-mail, you are hereby notified that any dissemination, distribution, copying, or action taken in relation to the contents of and attachments to this E-mail is strictly prohibited and may be unlawful. If you have received this E-mail in error, please notify the sender immediately and permanently delete the original and any copy of this E-mail and any printout.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com