I've a basic related question re: Firefly operation - would appreciate any insights: With three replicas, if checksum inconsistencies across replicas are found during deep-scrub then: a. does the majority win or is the primary always the winner and used to overwrite the secondaries b. is this reconciliation done automatically during deep-scrub or does each reconciliation have to be executed manually by the administrator? With 2 replicas - how are things different (if at all): a. The primary is declared the winner - correct? b. is this reconciliation done automatically during deep-scrub or does it have to be done "manually" because there is no majority? Thanks, -Sudip -----Original Message----- From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of Samuel Just Sent: Thursday, July 10, 2014 10:16 AM To: Christian Eichelmann Cc: ceph-users at lists.ceph.com Subject: Re: scrub error on firefly Can you attach your ceph.conf for your osds? -Sam On Thu, Jul 10, 2014 at 8:01 AM, Christian Eichelmann <christian.eichelmann at 1und1.de> wrote: > I can also confirm that after upgrading to firefly both of our > clusters (test and live) were going from 0 scrub errors each for about > 6 Month to about 9-12 per week... > This also makes me kind of nervous, since as far as I know everything > "ceph pg repair" does, is to copy the primary object to all replicas, > no matter which object is the correct one. > Of course the described method of manual checking works (for pools > with more than 2 replicas), but doing this in a large cluster nearly > every week is horribly timeconsuming and error prone. > It would be great to get an explanation for the increased numbers of > scrub errors since firefly. Were they just not detected correctly in > previous versions? Or is there maybe something wrong with the new code? > > Acutally, our company is currently preventing our projects to move to > ceph because of this problem. > > Regards, > Christian > ________________________________ > Von: ceph-users [ceph-users-bounces at lists.ceph.com]" im Auftrag von > "Travis Rhoden [trhoden at gmail.com] > Gesendet: Donnerstag, 10. Juli 2014 16:24 > An: Gregory Farnum > Cc: ceph-users at lists.ceph.com > Betreff: Re: [ceph-users] scrub error on firefly > > And actually just to follow-up, it does seem like there are some > additional smarts beyond just using the primary to overwrite the > secondaries... Since I captured md5 sums before and after the repair, > I can say that in this particular instance, the secondary copy was used to overwrite the primary. > So, I'm just trusting Ceph to the right thing, and so far it seems to, > but the comments here about needing to determine the correct object > and place it on the primary PG make me wonder if I've been missing something. > > - Travis > > > On Thu, Jul 10, 2014 at 10:19 AM, Travis Rhoden <trhoden at gmail.com> wrote: >> >> I can also say that after a recent upgrade to Firefly, I have >> experienced massive uptick in scrub errors. The cluster was on >> cuttlefish for about a year, and had maybe one or two scrub errors. >> After upgrading to Firefly, we've probably seen 3 to 4 dozen in the >> last month or so (was getting 2-3 a day for a few weeks until the whole cluster was rescrubbed, it seemed). >> >> What I cannot determine, however, is how to know which object is busted? >> For example, just today I ran into a scrub error. The object has two >> copies and is an 8MB piece of an RBD, and has identical timestamps, >> identical xattrs names and values. But it definitely has a different >> MD5 sum. How to know which one is correct? >> >> I've been just kicking off pg repair each time, which seems to just >> use the primary copy to overwrite the others. Haven't run into any >> issues with that so far, but it does make me nervous. >> >> - Travis >> >> >> On Tue, Jul 8, 2014 at 1:06 AM, Gregory Farnum <greg at inktank.com> wrote: >>> >>> It's not very intuitive or easy to look at right now (there are >>> plans from the recent developer summit to improve things), but the >>> central log should have output about exactly what objects are >>> busted. You'll then want to compare the copies manually to determine >>> which ones are good or bad, get the good copy on the primary (make >>> sure you preserve xattrs), and run repair. >>> -Greg >>> Software Engineer #42 @ http://inktank.com | http://ceph.com >>> >>> >>> On Mon, Jul 7, 2014 at 6:48 PM, Randy Smith <rbsmith at adams.edu> wrote: >>> > Greetings, >>> > >>> > I upgraded to firefly last week and I suddenly received this error: >>> > >>> > health HEALTH_ERR 1 pgs inconsistent; 1 scrub errors >>> > >>> > ceph health detail shows the following: >>> > >>> > HEALTH_ERR 1 pgs inconsistent; 1 scrub errors pg 3.c6 is >>> > active+clean+inconsistent, acting [2,5] >>> > 1 scrub errors >>> > >>> > The docs say that I can run `ceph pg repair 3.c6` to fix this. >>> > What I want to know is what are the risks of data loss if I run >>> > that command in this state and how can I mitigate them? >>> > >>> > -- >>> > Randall Smith >>> > Computing Services >>> > Adams State University >>> > http://www.adams.edu/ >>> > 719-587-7741 >>> > >>> > _______________________________________________ >>> > ceph-users mailing list >>> > ceph-users at lists.ceph.com >>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> > >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users at lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> > > > _______________________________________________ > ceph-users mailing list > ceph-users at lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing list ceph-users at lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com