Re: Ceph inconsistency after deep-scrub

Paweł Sadowski <ceph@xxxxxxxxx> · Fri, 21 Nov 2014 22:46:52 +0100

    W dniu 21.11.2014 o 20:12, Gregory
        Farnum pisze:

      On Fri, Nov 21, 2014 at 2:35 AM, Paweł Sadowski <ceph@xxxxxxxxx> wrote:

        Hi,

During deep-scrub Ceph discovered some inconsistency between OSDs on my
cluster (size 3, min size 2). I have fund broken object and calculated
md5sum of it on each OSD (osd.195 is acting_primary):
 osd.195 - md5sum_aaaa
 osd.40 - md5sum_aaaa
 osd.314 - md5sum_bbbb

I run ceph pg repair and Ceph successfully reported that everything went
OK. I checked md5sum of the objects again:
 osd.195 - md5sum_bbbb
 osd.40 - md5sum_bbbb
 osd.314 - md5sum_bbbb

This is a bit odd. How Ceph decides which copy is the correct one? Based
on last modification time/sequence number (or similar)? If yes, then why
object can be stored on one node only? If not, then why Ceph selected
osd.314 as a correct one? What would happen if osd.314 goes down? Will
ceph return wrong (old?) data, even with three copies and no failure in
the cluster?

      Right now, Ceph recovers replicated PGs by pushing the primary's copy
to everybody. There are tickets to improve this, but for now it's best
if you handle this yourself by moving the right things into place, or
removing the primary's copy if it's incorrect before running the
repair command. This is why it doesn't do repair automatically.
-Greg

    But in my case Ceph used non-primary's copy to repair data while
      two other OSDs had the same data (and one of them was primary).
      That should not happen.

    Beside that there should be big red warning in
      documentation[1] regarding ceph pg repair.

    1: http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/#pgs-inconsistent

    Cheers,

    PS

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com