scrub error on firefly

sam.just@xxxxxxxxxxx (Samuel Just) · Thu, 10 Jul 2014 10:16:16 -0700



Can you attach your ceph.conf for your osds?
-Sam

On Thu, Jul 10, 2014 at 8:01 AM, Christian Eichelmann
<christian.eichelmann at 1und1.de> wrote:
> I can also confirm that after upgrading to firefly both of our clusters
> (test and live) were going from 0 scrub errors each for about 6 Month to
> about 9-12 per week...
> This also makes me kind of nervous, since as far as I know everything "ceph
> pg repair" does, is to copy the primary object to all replicas, no matter
> which object is the correct one.
> Of course the described method of manual checking works (for pools with more
> than 2 replicas), but doing this in a large cluster nearly every week is
> horribly timeconsuming and error prone.
> It would be great to get an explanation for the increased numbers of scrub
> errors since firefly. Were they just not detected correctly in previous
> versions? Or is there maybe something wrong with the new code?
>
> Acutally, our company is currently preventing our projects to move to ceph
> because of this problem.
>
> Regards,
> Christian
> ________________________________
> Von: ceph-users [ceph-users-bounces at lists.ceph.com]" im Auftrag von "Travis
> Rhoden [trhoden at gmail.com]
> Gesendet: Donnerstag, 10. Juli 2014 16:24
> An: Gregory Farnum
> Cc: ceph-users at lists.ceph.com
> Betreff: Re: [ceph-users] scrub error on firefly
>
> And actually just to follow-up, it does seem like there are some additional
> smarts beyond just using the primary to overwrite the secondaries...  Since
> I captured md5 sums before and after the repair, I can say that in this
> particular instance, the secondary copy was used to overwrite the primary.
> So, I'm just trusting Ceph to the right thing, and so far it seems to, but
> the comments here about needing to determine the correct object and place it
> on the primary PG make me wonder if I've been missing something.
>
>  - Travis
>
>
> On Thu, Jul 10, 2014 at 10:19 AM, Travis Rhoden <trhoden at gmail.com> wrote:
>>
>> I can also say that after a recent upgrade to Firefly, I have experienced
>> massive uptick in scrub errors.  The cluster was on cuttlefish for about a
>> year, and had maybe one or two scrub errors.  After upgrading to Firefly,
>> we've probably seen 3 to 4 dozen in the last month or so (was getting 2-3 a
>> day for a few weeks until the whole cluster was rescrubbed, it seemed).
>>
>> What I cannot determine, however, is how to know which object is busted?
>> For example, just today I ran into a scrub error.  The object has two copies
>> and is an 8MB piece of an RBD, and has identical timestamps, identical
>> xattrs names and values.  But it definitely has a different MD5 sum. How to
>> know which one is correct?
>>
>> I've been just kicking off pg repair each time, which seems to just use
>> the primary copy to overwrite the others.  Haven't run into any issues with
>> that so far, but it does make me nervous.
>>
>>  - Travis
>>
>>
>> On Tue, Jul 8, 2014 at 1:06 AM, Gregory Farnum <greg at inktank.com> wrote:
>>>
>>> It's not very intuitive or easy to look at right now (there are plans
>>> from the recent developer summit to improve things), but the central
>>> log should have output about exactly what objects are busted. You'll
>>> then want to compare the copies manually to determine which ones are
>>> good or bad, get the good copy on the primary (make sure you preserve
>>> xattrs), and run repair.
>>> -Greg
>>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>>>
>>>
>>> On Mon, Jul 7, 2014 at 6:48 PM, Randy Smith <rbsmith at adams.edu> wrote:
>>> > Greetings,
>>> >
>>> > I upgraded to firefly last week and I suddenly received this error:
>>> >
>>> > health HEALTH_ERR 1 pgs inconsistent; 1 scrub errors
>>> >
>>> > ceph health detail shows the following:
>>> >
>>> > HEALTH_ERR 1 pgs inconsistent; 1 scrub errors
>>> > pg 3.c6 is active+clean+inconsistent, acting [2,5]
>>> > 1 scrub errors
>>> >
>>> > The docs say that I can run `ceph pg repair 3.c6` to fix this. What I
>>> > want
>>> > to know is what are the risks of data loss if I run that command in
>>> > this
>>> > state and how can I mitigate them?
>>> >
>>> > --
>>> > Randall Smith
>>> > Computing Services
>>> > Adams State University
>>> > http://www.adams.edu/
>>> > 719-587-7741
>>> >
>>> > _______________________________________________
>>> > ceph-users mailing list
>>> > ceph-users at lists.ceph.com
>>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> >
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users at lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>