Can you attach your ceph.conf for your osds? -Sam On Thu, Jul 10, 2014 at 8:01 AM, Christian Eichelmann <christian.eichelmann at 1und1.de> wrote: > I can also confirm that after upgrading to firefly both of our clusters > (test and live) were going from 0 scrub errors each for about 6 Month to > about 9-12 per week... > This also makes me kind of nervous, since as far as I know everything "ceph > pg repair" does, is to copy the primary object to all replicas, no matter > which object is the correct one. > Of course the described method of manual checking works (for pools with more > than 2 replicas), but doing this in a large cluster nearly every week is > horribly timeconsuming and error prone. > It would be great to get an explanation for the increased numbers of scrub > errors since firefly. Were they just not detected correctly in previous > versions? Or is there maybe something wrong with the new code? > > Acutally, our company is currently preventing our projects to move to ceph > because of this problem. > > Regards, > Christian > ________________________________ > Von: ceph-users [ceph-users-bounces at lists.ceph.com]" im Auftrag von "Travis > Rhoden [trhoden at gmail.com] > Gesendet: Donnerstag, 10. Juli 2014 16:24 > An: Gregory Farnum > Cc: ceph-users at lists.ceph.com > Betreff: Re: [ceph-users] scrub error on firefly > > And actually just to follow-up, it does seem like there are some additional > smarts beyond just using the primary to overwrite the secondaries... Since > I captured md5 sums before and after the repair, I can say that in this > particular instance, the secondary copy was used to overwrite the primary. > So, I'm just trusting Ceph to the right thing, and so far it seems to, but > the comments here about needing to determine the correct object and place it > on the primary PG make me wonder if I've been missing something. > > - Travis > > > On Thu, Jul 10, 2014 at 10:19 AM, Travis Rhoden <trhoden at gmail.com> wrote: >> >> I can also say that after a recent upgrade to Firefly, I have experienced >> massive uptick in scrub errors. The cluster was on cuttlefish for about a >> year, and had maybe one or two scrub errors. After upgrading to Firefly, >> we've probably seen 3 to 4 dozen in the last month or so (was getting 2-3 a >> day for a few weeks until the whole cluster was rescrubbed, it seemed). >> >> What I cannot determine, however, is how to know which object is busted? >> For example, just today I ran into a scrub error. The object has two copies >> and is an 8MB piece of an RBD, and has identical timestamps, identical >> xattrs names and values. But it definitely has a different MD5 sum. How to >> know which one is correct? >> >> I've been just kicking off pg repair each time, which seems to just use >> the primary copy to overwrite the others. Haven't run into any issues with >> that so far, but it does make me nervous. >> >> - Travis >> >> >> On Tue, Jul 8, 2014 at 1:06 AM, Gregory Farnum <greg at inktank.com> wrote: >>> >>> It's not very intuitive or easy to look at right now (there are plans >>> from the recent developer summit to improve things), but the central >>> log should have output about exactly what objects are busted. You'll >>> then want to compare the copies manually to determine which ones are >>> good or bad, get the good copy on the primary (make sure you preserve >>> xattrs), and run repair. >>> -Greg >>> Software Engineer #42 @ http://inktank.com | http://ceph.com >>> >>> >>> On Mon, Jul 7, 2014 at 6:48 PM, Randy Smith <rbsmith at adams.edu> wrote: >>> > Greetings, >>> > >>> > I upgraded to firefly last week and I suddenly received this error: >>> > >>> > health HEALTH_ERR 1 pgs inconsistent; 1 scrub errors >>> > >>> > ceph health detail shows the following: >>> > >>> > HEALTH_ERR 1 pgs inconsistent; 1 scrub errors >>> > pg 3.c6 is active+clean+inconsistent, acting [2,5] >>> > 1 scrub errors >>> > >>> > The docs say that I can run `ceph pg repair 3.c6` to fix this. What I >>> > want >>> > to know is what are the risks of data loss if I run that command in >>> > this >>> > state and how can I mitigate them? >>> > >>> > -- >>> > Randall Smith >>> > Computing Services >>> > Adams State University >>> > http://www.adams.edu/ >>> > 719-587-7741 >>> > >>> > _______________________________________________ >>> > ceph-users mailing list >>> > ceph-users at lists.ceph.com >>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> > >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users at lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> > > > _______________________________________________ > ceph-users mailing list > ceph-users at lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >