scrub error on firefly

trhoden@xxxxxxxxx (Travis Rhoden) · Thu, 10 Jul 2014 10:19:11 -0400

I can also say that after a recent upgrade to Firefly, I have experienced
massive uptick in scrub errors.  The cluster was on cuttlefish for about a
year, and had maybe one or two scrub errors.  After upgrading to Firefly,
we've probably seen 3 to 4 dozen in the last month or so (was getting 2-3 a
day for a few weeks until the whole cluster was rescrubbed, it seemed).

What I cannot determine, however, is how to know which object is busted?
For example, just today I ran into a scrub error.  The object has two
copies and is an 8MB piece of an RBD, and has identical timestamps,
identical xattrs names and values.  But it definitely has a different MD5
sum. How to know which one is correct?

I've been just kicking off pg repair each time, which seems to just use the
primary copy to overwrite the others.  Haven't run into any issues with
that so far, but it does make me nervous.

 - Travis

On Tue, Jul 8, 2014 at 1:06 AM, Gregory Farnum <greg at inktank.com> wrote:

> It's not very intuitive or easy to look at right now (there are plans
> from the recent developer summit to improve things), but the central
> log should have output about exactly what objects are busted. You'll
> then want to compare the copies manually to determine which ones are
> good or bad, get the good copy on the primary (make sure you preserve
> xattrs), and run repair.
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
>
>
> On Mon, Jul 7, 2014 at 6:48 PM, Randy Smith <rbsmith at adams.edu> wrote:
> > Greetings,
> >
> > I upgraded to firefly last week and I suddenly received this error:
> >
> > health HEALTH_ERR 1 pgs inconsistent; 1 scrub errors
> >
> > ceph health detail shows the following:
> >
> > HEALTH_ERR 1 pgs inconsistent; 1 scrub errors
> > pg 3.c6 is active+clean+inconsistent, acting [2,5]
> > 1 scrub errors
> >
> > The docs say that I can run `ceph pg repair 3.c6` to fix this. What I
> want
> > to know is what are the risks of data loss if I run that command in this
> > state and how can I mitigate them?
> >
> > --
> > Randall Smith
> > Computing Services
> > Adams State University
> > http://www.adams.edu/
> > 719-587-7741
> >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users at lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140710/6ef91d1d/attachment.htm>