$ lsb_release -a LSB Version: core-2.0-amd64:core-2.0-noarch:core-3.0-amd64:core-3.0-noarch:core-3.1-amd64:core-3.1-noarch:core-3.2-amd64:core-3.2-noarch:core-4.0-amd64:core-4.0-noarch Distributor ID: Ubuntu Description: Ubuntu 12.04.4 LTS Release: 12.04 Codename: precise $ uname -a Linux droopy 3.2.0-64-generic #97-Ubuntu SMP Wed Jun 4 22:04:21 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux On Sat, Jul 12, 2014 at 3:21 PM, Samuel Just <sam.just at inktank.com> wrote: > Also, what distribution and kernel version are you using? > -Sam > On Jul 12, 2014 10:46 AM, "Samuel Just" <sam.just at inktank.com> wrote: > >> When you see another one, can you include the xattrs on the files as >> well (you can use the attr(1) utility)? >> -Sam >> >> On Sat, Jul 12, 2014 at 9:51 AM, Randy Smith <rbsmith at adams.edu> wrote: >> > That image is the root file system for a linux ldap server. >> > >> > -- >> > Randall Smith >> > Adams State University >> > www.adams.edu >> > 719-587-7741 >> > >> > On Jul 12, 2014 10:34 AM, "Samuel Just" <sam.just at inktank.com> wrote: >> >> >> >> Here's a diff of the two files. One of the two files appears to >> >> contain ceph leveldb keys? Randy, do you have an idea of what this >> >> rbd image is being used for (rb.0.b0ce3.238e1f29, that is). >> >> -Sam >> >> >> >> On Fri, Jul 11, 2014 at 7:25 PM, Randy Smith <rbsmith at adams.edu> >> wrote: >> >> > Greetings, >> >> > >> >> > Well it happened again with two pgs this time, still in the same rbd >> >> > image. >> >> > They are at http://people.adams.edu/~rbsmith/osd.tar. I think I >> grabbed >> >> > the >> >> > files correctly. If not, let me know and I'll try again on the next >> >> > failure. >> >> > It certainly is happening often enough. >> >> > >> >> > >> >> > On Fri, Jul 11, 2014 at 3:39 PM, Samuel Just <sam.just at inktank.com> >> >> > wrote: >> >> >> >> >> >> And grab the xattrs as well. >> >> >> -Sam >> >> >> >> >> >> On Fri, Jul 11, 2014 at 2:39 PM, Samuel Just <sam.just at inktank.com> >> >> >> wrote: >> >> >> > Right. >> >> >> > -Sam >> >> >> > >> >> >> > On Fri, Jul 11, 2014 at 2:05 PM, Randy Smith <rbsmith at adams.edu> >> >> >> > wrote: >> >> >> >> Greetings, >> >> >> >> >> >> >> >> I'm using xfs. >> >> >> >> >> >> >> >> Also, when, in a previous email, you asked if I could send the >> >> >> >> object, >> >> >> >> do >> >> >> >> you mean the files from each server named something like this: >> >> >> >> >> >> >> >> >> >> >> >> >> ./3.c6_head/DIR_6/DIR_C/DIR_5/rb.0.b0ce3.238e1f29.00000000000b__head_34DC35C6__3 >> >> >> >> ? >> >> >> >> >> >> >> >> >> >> >> >> On Fri, Jul 11, 2014 at 2:00 PM, Samuel Just < >> sam.just at inktank.com> >> >> >> >> wrote: >> >> >> >>> >> >> >> >>> Also, what filesystem are you using? >> >> >> >>> -Sam >> >> >> >>> >> >> >> >>> On Fri, Jul 11, 2014 at 10:37 AM, Sage Weil <sweil at redhat.com> >> >> >> >>> wrote: >> >> >> >>> > One other thing we might also try is catching this earlier (on >> >> >> >>> > first >> >> >> >>> > read >> >> >> >>> > of corrupt data) instead of waiting for scrub. If you are not >> >> >> >>> > super >> >> >> >>> > performance sensitive, you can add >> >> >> >>> > >> >> >> >>> > filestore sloppy crc = true >> >> >> >>> > filestore sloppy crc block size = 524288 >> >> >> >>> > >> >> >> >>> > That will track and verify CRCs on any large (>512k) writes. >> >> >> >>> > Smaller >> >> >> >>> > block sizes will give more precision and more checks, but will >> >> >> >>> > generate >> >> >> >>> > larger xattrs and have a bigger impact on performance... >> >> >> >>> > >> >> >> >>> > sage >> >> >> >>> > >> >> >> >>> > >> >> >> >>> > On Fri, 11 Jul 2014, Samuel Just wrote: >> >> >> >>> > >> >> >> >>> >> When you get the next inconsistency, can you copy the actual >> >> >> >>> >> objects >> >> >> >>> >> from the osd store trees and get them to us? That might >> provide >> >> >> >>> >> a >> >> >> >>> >> clue. >> >> >> >>> >> -Sam >> >> >> >>> >> >> >> >> >>> >> On Fri, Jul 11, 2014 at 6:52 AM, Randy Smith < >> rbsmith at adams.edu> >> >> >> >>> >> wrote: >> >> >> >>> >> > >> >> >> >>> >> > >> >> >> >>> >> > >> >> >> >>> >> > On Thu, Jul 10, 2014 at 4:40 PM, Samuel Just >> >> >> >>> >> > <sam.just at inktank.com> >> >> >> >>> >> > wrote: >> >> >> >>> >> >> >> >> >> >>> >> >> It could be an indication of a problem on osd 5, but the >> >> >> >>> >> >> timing >> >> >> >>> >> >> is >> >> >> >>> >> >> worrying. Can you attach your ceph.conf? >> >> >> >>> >> > >> >> >> >>> >> > >> >> >> >>> >> > Attached. >> >> >> >>> >> > >> >> >> >>> >> >> >> >> >> >>> >> >> Have there been any osds >> >> >> >>> >> >> going down, new osds added, anything to cause recovery? >> >> >> >>> >> > >> >> >> >>> >> > >> >> >> >>> >> > I upgraded to firefly last week. As part of the upgrade I, >> >> >> >>> >> > obviously, >> >> >> >>> >> > had to >> >> >> >>> >> > restart every osd. Also, I attempted to switch to the >> optimal >> >> >> >>> >> > tunables but >> >> >> >>> >> > doing so degraded 27% of my cluster and made most of my VMs >> >> >> >>> >> > unresponsive. I >> >> >> >>> >> > switched back to the legacy tunables and everything was >> happy >> >> >> >>> >> > again. >> >> >> >>> >> > Both of >> >> >> >>> >> > those operations, of course, caused recoveries. I have >> made no >> >> >> >>> >> > changes since >> >> >> >>> >> > then. >> >> >> >>> >> > >> >> >> >>> >> >> >> >> >> >>> >> >> Anything in >> >> >> >>> >> >> dmesg to indicate an fs problem? >> >> >> >>> >> > >> >> >> >>> >> > >> >> >> >>> >> > Nothing. The system went inconsistent again this morning, >> >> >> >>> >> > again >> >> >> >>> >> > on >> >> >> >>> >> > the same >> >> >> >>> >> > rbd but different osds this time. >> >> >> >>> >> > >> >> >> >>> >> > 2014-07-11 05:48:12.857657 osd.1 192.168.253.77:6801/12608 >> 904 >> >> >> >>> >> > : >> >> >> >>> >> > [ERR] 3.76 >> >> >> >>> >> > shard 1: soid >> 1280076/rb.0.b0ce3.238e1f29.00000000025c/head//3 >> >> >> >>> >> > digest >> >> >> >>> >> > 2198242284 != known digest 3879754377 >> >> >> >>> >> > 2014-07-11 05:49:29.020024 osd.1 192.168.253.77:6801/12608 >> 905 >> >> >> >>> >> > : >> >> >> >>> >> > [ERR] 3.76 >> >> >> >>> >> > deep-scrub 0 missing, 1 inconsistent objects >> >> >> >>> >> > 2014-07-11 05:49:29.020029 osd.1 192.168.253.77:6801/12608 >> 906 >> >> >> >>> >> > : >> >> >> >>> >> > [ERR] 3.76 >> >> >> >>> >> > deep-scrub 1 errors >> >> >> >>> >> > >> >> >> >>> >> > $ ceph health detail >> >> >> >>> >> > HEALTH_ERR 1 pgs inconsistent; 1 scrub errors >> >> >> >>> >> > pg 3.76 is active+clean+inconsistent, acting [1,2] >> >> >> >>> >> > 1 scrub errors >> >> >> >>> >> > >> >> >> >>> >> > >> >> >> >>> >> >> >> >> >> >>> >> >> Have you recently changed any >> >> >> >>> >> >> settings? >> >> >> >>> >> > >> >> >> >>> >> > >> >> >> >>> >> > I upgraded from bobtail to dumpling to firefly. >> >> >> >>> >> > >> >> >> >>> >> >> >> >> >> >>> >> >> -Sam >> >> >> >>> >> >> >> >> >> >>> >> >> On Thu, Jul 10, 2014 at 2:58 PM, Randy Smith >> >> >> >>> >> >> <rbsmith at adams.edu> >> >> >> >>> >> >> wrote: >> >> >> >>> >> >> > Greetings, >> >> >> >>> >> >> > >> >> >> >>> >> >> > Just a follow up on my original issue. =ceph pg repair >> ...= >> >> >> >>> >> >> > fixed >> >> >> >>> >> >> > the >> >> >> >>> >> >> > problem. However, today I got another inconsistent pg. >> It's >> >> >> >>> >> >> > interesting >> >> >> >>> >> >> > to >> >> >> >>> >> >> > me that this second error is in the same rbd image and >> >> >> >>> >> >> > appears >> >> >> >>> >> >> > to >> >> >> >>> >> >> > be >> >> >> >>> >> >> > "close" >> >> >> >>> >> >> > to the previously inconsistent pg. (Even more fun, osd.5 >> >> >> >>> >> >> > was >> >> >> >>> >> >> > the >> >> >> >>> >> >> > secondary >> >> >> >>> >> >> > in the first error and is the primary here though the >> other >> >> >> >>> >> >> > osd is >> >> >> >>> >> >> > different.) >> >> >> >>> >> >> > >> >> >> >>> >> >> > Is this indicative of a problem on osd.5 or perhaps a >> clue >> >> >> >>> >> >> > into >> >> >> >>> >> >> > what's >> >> >> >>> >> >> > causing firefly to be so inconsistent? >> >> >> >>> >> >> > >> >> >> >>> >> >> > The relevant log entries are below. >> >> >> >>> >> >> > >> >> >> >>> >> >> > 2014-07-07 18:50:48.646407 osd.2 >> 192.168.253.70:6801/56987 >> >> >> >>> >> >> > 163 >> >> >> >>> >> >> > : >> >> >> >>> >> >> > [ERR] >> >> >> >>> >> >> > 3.c6 >> >> >> >>> >> >> > shard 2: soid >> >> >> >>> >> >> > 34dc35c6/rb.0.b0ce3.238e1f29.00000000000b/head//3 >> >> >> >>> >> >> > digest >> >> >> >>> >> >> > 2256074002 != known digest 3998068918 >> >> >> >>> >> >> > 2014-07-07 18:51:36.936076 osd.2 >> 192.168.253.70:6801/56987 >> >> >> >>> >> >> > 164 >> >> >> >>> >> >> > : >> >> >> >>> >> >> > [ERR] >> >> >> >>> >> >> > 3.c6 >> >> >> >>> >> >> > deep-scrub 0 missing, 1 inconsistent objects >> >> >> >>> >> >> > 2014-07-07 18:51:36.936082 osd.2 >> 192.168.253.70:6801/56987 >> >> >> >>> >> >> > 165 >> >> >> >>> >> >> > : >> >> >> >>> >> >> > [ERR] >> >> >> >>> >> >> > 3.c6 >> >> >> >>> >> >> > deep-scrub 1 errors >> >> >> >>> >> >> > >> >> >> >>> >> >> > >> >> >> >>> >> >> > 2014-07-10 15:38:53.990328 osd.5 >> 192.168.253.81:6800/10013 >> >> >> >>> >> >> > 257 >> >> >> >>> >> >> > : >> >> >> >>> >> >> > [ERR] >> >> >> >>> >> >> > 3.41 >> >> >> >>> >> >> > shard 1: soid >> >> >> >>> >> >> > e183cc41/rb.0.b0ce3.238e1f29.00000000024c/head//3 >> >> >> >>> >> >> > digest >> >> >> >>> >> >> > 3224286363 != known digest 3409342281 >> >> >> >>> >> >> > 2014-07-10 15:39:11.701276 osd.5 >> 192.168.253.81:6800/10013 >> >> >> >>> >> >> > 258 >> >> >> >>> >> >> > : >> >> >> >>> >> >> > [ERR] >> >> >> >>> >> >> > 3.41 >> >> >> >>> >> >> > deep-scrub 0 missing, 1 inconsistent objects >> >> >> >>> >> >> > 2014-07-10 15:39:11.701281 osd.5 >> 192.168.253.81:6800/10013 >> >> >> >>> >> >> > 259 >> >> >> >>> >> >> > : >> >> >> >>> >> >> > [ERR] >> >> >> >>> >> >> > 3.41 >> >> >> >>> >> >> > deep-scrub 1 errors >> >> >> >>> >> >> > >> >> >> >>> >> >> > >> >> >> >>> >> >> > >> >> >> >>> >> >> > On Thu, Jul 10, 2014 at 12:05 PM, Chahal, Sudip >> >> >> >>> >> >> > <sudip.chahal at intel.com> >> >> >> >>> >> >> > wrote: >> >> >> >>> >> >> >> >> >> >> >>> >> >> >> Thanks - so it appears that the advantage of the 3rd >> >> >> >>> >> >> >> replica >> >> >> >>> >> >> >> (relative >> >> >> >>> >> >> >> to >> >> >> >>> >> >> >> 2 replicas) has to do much more with recovering from >> two >> >> >> >>> >> >> >> concurrent OSD >> >> >> >>> >> >> >> failures than with inconsistencies found during deep >> scrub >> >> >> >>> >> >> >> - >> >> >> >>> >> >> >> would you >> >> >> >>> >> >> >> agree? >> >> >> >>> >> >> >> >> >> >> >>> >> >> >> Re: repair - do you mean the "repair" process during >> deep >> >> >> >>> >> >> >> scrub >> >> >> >>> >> >> >> - if >> >> >> >>> >> >> >> yes, >> >> >> >>> >> >> >> this is automatic - correct? >> >> >> >>> >> >> >> Or >> >> >> >>> >> >> >> Are you referring to the explicit manually initiated >> >> >> >>> >> >> >> repair >> >> >> >>> >> >> >> commands? >> >> >> >>> >> >> >> >> >> >> >>> >> >> >> Thanks, >> >> >> >>> >> >> >> >> >> >> >>> >> >> >> -Sudip >> >> >> >>> >> >> >> >> >> >> >>> >> >> >> -----Original Message----- >> >> >> >>> >> >> >> From: Samuel Just [mailto:sam.just at inktank.com] >> >> >> >>> >> >> >> Sent: Thursday, July 10, 2014 10:50 AM >> >> >> >>> >> >> >> To: Chahal, Sudip >> >> >> >>> >> >> >> Cc: Christian Eichelmann; ceph-users at lists.ceph.com >> >> >> >>> >> >> >> Subject: Re: [ceph-users] scrub error on firefly >> >> >> >>> >> >> >> >> >> >> >>> >> >> >> Repair I think will tend to choose the copy with the >> >> >> >>> >> >> >> lowest >> >> >> >>> >> >> >> osd >> >> >> >>> >> >> >> number >> >> >> >>> >> >> >> which is not obviously corrupted. Even with three >> >> >> >>> >> >> >> replicas, >> >> >> >>> >> >> >> it >> >> >> >>> >> >> >> does >> >> >> >>> >> >> >> not do >> >> >> >>> >> >> >> any kind of voting at this time. >> >> >> >>> >> >> >> -Sam >> >> >> >>> >> >> >> >> >> >> >>> >> >> >> On Thu, Jul 10, 2014 at 10:39 AM, Chahal, Sudip >> >> >> >>> >> >> >> <sudip.chahal at intel.com> >> >> >> >>> >> >> >> wrote: >> >> >> >>> >> >> >> > I've a basic related question re: Firefly operation - >> >> >> >>> >> >> >> > would >> >> >> >>> >> >> >> > appreciate >> >> >> >>> >> >> >> > any insights: >> >> >> >>> >> >> >> > >> >> >> >>> >> >> >> > With three replicas, if checksum inconsistencies >> across >> >> >> >>> >> >> >> > replicas are >> >> >> >>> >> >> >> > found during deep-scrub then: >> >> >> >>> >> >> >> > a. does the majority win or is the primary >> >> >> >>> >> >> >> > always >> >> >> >>> >> >> >> > the >> >> >> >>> >> >> >> > winner >> >> >> >>> >> >> >> > and used to overwrite the secondaries >> >> >> >>> >> >> >> > b. is this reconciliation done >> >> >> >>> >> >> >> > automatically >> >> >> >>> >> >> >> > during >> >> >> >>> >> >> >> > deep-scrub or does each reconciliation have to be >> >> >> >>> >> >> >> > executed >> >> >> >>> >> >> >> > manually >> >> >> >>> >> >> >> > by the >> >> >> >>> >> >> >> > administrator? >> >> >> >>> >> >> >> > >> >> >> >>> >> >> >> > With 2 replicas - how are things different (if at >> all): >> >> >> >>> >> >> >> > a. The primary is declared the winner >> - >> >> >> >>> >> >> >> > correct? >> >> >> >>> >> >> >> > b. is this reconciliation done >> >> >> >>> >> >> >> > automatically >> >> >> >>> >> >> >> > during >> >> >> >>> >> >> >> > deep-scrub or does it have to be done "manually" >> because >> >> >> >>> >> >> >> > there >> >> >> >>> >> >> >> > is no >> >> >> >>> >> >> >> > majority? >> >> >> >>> >> >> >> > >> >> >> >>> >> >> >> > Thanks, >> >> >> >>> >> >> >> > >> >> >> >>> >> >> >> > -Sudip >> >> >> >>> >> >> >> > >> >> >> >>> >> >> >> > >> >> >> >>> >> >> >> > -----Original Message----- >> >> >> >>> >> >> >> > From: ceph-users >> >> >> >>> >> >> >> > [mailto:ceph-users-bounces at lists.ceph.com] >> >> >> >>> >> >> >> > On >> >> >> >>> >> >> >> > Behalf >> >> >> >>> >> >> >> > Of Samuel Just >> >> >> >>> >> >> >> > Sent: Thursday, July 10, 2014 10:16 AM >> >> >> >>> >> >> >> > To: Christian Eichelmann >> >> >> >>> >> >> >> > Cc: ceph-users at lists.ceph.com >> >> >> >>> >> >> >> > Subject: Re: [ceph-users] scrub error on firefly >> >> >> >>> >> >> >> > >> >> >> >>> >> >> >> > Can you attach your ceph.conf for your osds? >> >> >> >>> >> >> >> > -Sam >> >> >> >>> >> >> >> > >> >> >> >>> >> >> >> > On Thu, Jul 10, 2014 at 8:01 AM, Christian Eichelmann >> >> >> >>> >> >> >> > <christian.eichelmann at 1und1.de> wrote: >> >> >> >>> >> >> >> >> I can also confirm that after upgrading to firefly >> both >> >> >> >>> >> >> >> >> of >> >> >> >>> >> >> >> >> our >> >> >> >>> >> >> >> >> clusters (test and live) were going from 0 scrub >> errors >> >> >> >>> >> >> >> >> each >> >> >> >>> >> >> >> >> for >> >> >> >>> >> >> >> >> about >> >> >> >>> >> >> >> >> 6 Month to about 9-12 per week... >> >> >> >>> >> >> >> >> This also makes me kind of nervous, since as far as >> I >> >> >> >>> >> >> >> >> know >> >> >> >>> >> >> >> >> everything >> >> >> >>> >> >> >> >> "ceph pg repair" does, is to copy the primary >> object to >> >> >> >>> >> >> >> >> all >> >> >> >>> >> >> >> >> replicas, >> >> >> >>> >> >> >> >> no matter which object is the correct one. >> >> >> >>> >> >> >> >> Of course the described method of manual checking >> works >> >> >> >>> >> >> >> >> (for >> >> >> >>> >> >> >> >> pools >> >> >> >>> >> >> >> >> with more than 2 replicas), but doing this in a >> large >> >> >> >>> >> >> >> >> cluster >> >> >> >>> >> >> >> >> nearly >> >> >> >>> >> >> >> >> every week is horribly timeconsuming and error >> prone. >> >> >> >>> >> >> >> >> It would be great to get an explanation for the >> >> >> >>> >> >> >> >> increased >> >> >> >>> >> >> >> >> numbers of >> >> >> >>> >> >> >> >> scrub errors since firefly. Were they just not >> detected >> >> >> >>> >> >> >> >> correctly in >> >> >> >>> >> >> >> >> previous versions? Or is there maybe something wrong >> >> >> >>> >> >> >> >> with >> >> >> >>> >> >> >> >> the >> >> >> >>> >> >> >> >> new >> >> >> >>> >> >> >> >> code? >> >> >> >>> >> >> >> >> >> >> >> >>> >> >> >> >> Acutally, our company is currently preventing our >> >> >> >>> >> >> >> >> projects >> >> >> >>> >> >> >> >> to >> >> >> >>> >> >> >> >> move >> >> >> >>> >> >> >> >> to >> >> >> >>> >> >> >> >> ceph because of this problem. >> >> >> >>> >> >> >> >> >> >> >> >>> >> >> >> >> Regards, >> >> >> >>> >> >> >> >> Christian >> >> >> >>> >> >> >> >> ________________________________ >> >> >> >>> >> >> >> >> Von: ceph-users [ceph-users-bounces at lists.ceph.com]" >> im >> >> >> >>> >> >> >> >> Auftrag von >> >> >> >>> >> >> >> >> "Travis Rhoden [trhoden at gmail.com] >> >> >> >>> >> >> >> >> Gesendet: Donnerstag, 10. Juli 2014 16:24 >> >> >> >>> >> >> >> >> An: Gregory Farnum >> >> >> >>> >> >> >> >> Cc: ceph-users at lists.ceph.com >> >> >> >>> >> >> >> >> Betreff: Re: [ceph-users] scrub error on firefly >> >> >> >>> >> >> >> >> >> >> >> >>> >> >> >> >> And actually just to follow-up, it does seem like >> there >> >> >> >>> >> >> >> >> are >> >> >> >>> >> >> >> >> some >> >> >> >>> >> >> >> >> additional smarts beyond just using the primary to >> >> >> >>> >> >> >> >> overwrite >> >> >> >>> >> >> >> >> the >> >> >> >>> >> >> >> >> secondaries... Since I captured md5 sums before and >> >> >> >>> >> >> >> >> after >> >> >> >>> >> >> >> >> the >> >> >> >>> >> >> >> >> repair, I can say that in this particular instance, >> the >> >> >> >>> >> >> >> >> secondary >> >> >> >>> >> >> >> >> copy >> >> >> >>> >> >> >> >> was used to overwrite the primary. >> >> >> >>> >> >> >> >> So, I'm just trusting Ceph to the right thing, and >> so >> >> >> >>> >> >> >> >> far >> >> >> >>> >> >> >> >> it >> >> >> >>> >> >> >> >> seems >> >> >> >>> >> >> >> >> to, but the comments here about needing to determine >> >> >> >>> >> >> >> >> the >> >> >> >>> >> >> >> >> correct >> >> >> >>> >> >> >> >> object and place it on the primary PG make me >> wonder if >> >> >> >>> >> >> >> >> I've >> >> >> >>> >> >> >> >> been >> >> >> >>> >> >> >> >> missing something. >> >> >> >>> >> >> >> >> >> >> >> >>> >> >> >> >> - Travis >> >> >> >>> >> >> >> >> >> >> >> >>> >> >> >> >> >> >> >> >>> >> >> >> >> On Thu, Jul 10, 2014 at 10:19 AM, Travis Rhoden >> >> >> >>> >> >> >> >> <trhoden at gmail.com> >> >> >> >>> >> >> >> >> wrote: >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> I can also say that after a recent upgrade to >> Firefly, >> >> >> >>> >> >> >> >>> I >> >> >> >>> >> >> >> >>> have >> >> >> >>> >> >> >> >>> experienced massive uptick in scrub errors. The >> >> >> >>> >> >> >> >>> cluster >> >> >> >>> >> >> >> >>> was >> >> >> >>> >> >> >> >>> on >> >> >> >>> >> >> >> >>> cuttlefish for about a year, and had maybe one or >> two >> >> >> >>> >> >> >> >>> scrub >> >> >> >>> >> >> >> >>> errors. >> >> >> >>> >> >> >> >>> After upgrading to Firefly, we've probably seen 3 >> to 4 >> >> >> >>> >> >> >> >>> dozen >> >> >> >>> >> >> >> >>> in the >> >> >> >>> >> >> >> >>> last month or so (was getting 2-3 a day for a few >> >> >> >>> >> >> >> >>> weeks >> >> >> >>> >> >> >> >>> until >> >> >> >>> >> >> >> >>> the >> >> >> >>> >> >> >> >>> whole cluster was rescrubbed, it seemed). >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> What I cannot determine, however, is how to know >> which >> >> >> >>> >> >> >> >>> object >> >> >> >>> >> >> >> >>> is >> >> >> >>> >> >> >> >>> busted? >> >> >> >>> >> >> >> >>> For example, just today I ran into a scrub error. >> The >> >> >> >>> >> >> >> >>> object >> >> >> >>> >> >> >> >>> has >> >> >> >>> >> >> >> >>> two copies and is an 8MB piece of an RBD, and has >> >> >> >>> >> >> >> >>> identical >> >> >> >>> >> >> >> >>> timestamps, identical xattrs names and values. >> But it >> >> >> >>> >> >> >> >>> definitely >> >> >> >>> >> >> >> >>> has a different >> >> >> >>> >> >> >> >>> MD5 sum. How to know which one is correct? >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> I've been just kicking off pg repair each time, >> which >> >> >> >>> >> >> >> >>> seems >> >> >> >>> >> >> >> >>> to just >> >> >> >>> >> >> >> >>> use the primary copy to overwrite the others. >> Haven't >> >> >> >>> >> >> >> >>> run >> >> >> >>> >> >> >> >>> into any >> >> >> >>> >> >> >> >>> issues with that so far, but it does make me >> nervous. >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> - Travis >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> On Tue, Jul 8, 2014 at 1:06 AM, Gregory Farnum >> >> >> >>> >> >> >> >>> <greg at inktank.com> >> >> >> >>> >> >> >> >>> wrote: >> >> >> >>> >> >> >> >>>> >> >> >> >>> >> >> >> >>>> It's not very intuitive or easy to look at right >> now >> >> >> >>> >> >> >> >>>> (there >> >> >> >>> >> >> >> >>>> are >> >> >> >>> >> >> >> >>>> plans from the recent developer summit to improve >> >> >> >>> >> >> >> >>>> things), >> >> >> >>> >> >> >> >>>> but the >> >> >> >>> >> >> >> >>>> central log should have output about exactly what >> >> >> >>> >> >> >> >>>> objects >> >> >> >>> >> >> >> >>>> are >> >> >> >>> >> >> >> >>>> busted. You'll then want to compare the copies >> >> >> >>> >> >> >> >>>> manually >> >> >> >>> >> >> >> >>>> to >> >> >> >>> >> >> >> >>>> determine which ones are good or bad, get the good >> >> >> >>> >> >> >> >>>> copy >> >> >> >>> >> >> >> >>>> on >> >> >> >>> >> >> >> >>>> the >> >> >> >>> >> >> >> >>>> primary (make sure you preserve xattrs), and run >> >> >> >>> >> >> >> >>>> repair. >> >> >> >>> >> >> >> >>>> -Greg >> >> >> >>> >> >> >> >>>> Software Engineer #42 @ http://inktank.com | >> >> >> >>> >> >> >> >>>> http://ceph.com >> >> >> >>> >> >> >> >>>> >> >> >> >>> >> >> >> >>>> >> >> >> >>> >> >> >> >>>> On Mon, Jul 7, 2014 at 6:48 PM, Randy Smith >> >> >> >>> >> >> >> >>>> <rbsmith at adams.edu> >> >> >> >>> >> >> >> >>>> wrote: >> >> >> >>> >> >> >> >>>> > Greetings, >> >> >> >>> >> >> >> >>>> > >> >> >> >>> >> >> >> >>>> > I upgraded to firefly last week and I suddenly >> >> >> >>> >> >> >> >>>> > received >> >> >> >>> >> >> >> >>>> > this >> >> >> >>> >> >> >> >>>> > error: >> >> >> >>> >> >> >> >>>> > >> >> >> >>> >> >> >> >>>> > health HEALTH_ERR 1 pgs inconsistent; 1 scrub >> >> >> >>> >> >> >> >>>> > errors >> >> >> >>> >> >> >> >>>> > >> >> >> >>> >> >> >> >>>> > ceph health detail shows the following: >> >> >> >>> >> >> >> >>>> > >> >> >> >>> >> >> >> >>>> > HEALTH_ERR 1 pgs inconsistent; 1 scrub errors pg >> >> >> >>> >> >> >> >>>> > 3.c6 >> >> >> >>> >> >> >> >>>> > is >> >> >> >>> >> >> >> >>>> > active+clean+inconsistent, acting [2,5] >> >> >> >>> >> >> >> >>>> > 1 scrub errors >> >> >> >>> >> >> >> >>>> > >> >> >> >>> >> >> >> >>>> > The docs say that I can run `ceph pg repair >> 3.c6` >> >> >> >>> >> >> >> >>>> > to >> >> >> >>> >> >> >> >>>> > fix >> >> >> >>> >> >> >> >>>> > this. >> >> >> >>> >> >> >> >>>> > What I want to know is what are the risks of >> data >> >> >> >>> >> >> >> >>>> > loss >> >> >> >>> >> >> >> >>>> > if >> >> >> >>> >> >> >> >>>> > I run >> >> >> >>> >> >> >> >>>> > that command in this state and how can I >> mitigate >> >> >> >>> >> >> >> >>>> > them? >> >> >> >>> >> >> >> >>>> > >> >> >> >>> >> >> >> >>>> > -- >> >> >> >>> >> >> >> >>>> > Randall Smith >> >> >> >>> >> >> >> >>>> > Computing Services >> >> >> >>> >> >> >> >>>> > Adams State University >> >> >> >>> >> >> >> >>>> > http://www.adams.edu/ >> >> >> >>> >> >> >> >>>> > 719-587-7741 >> >> >> >>> >> >> >> >>>> > >> >> >> >>> >> >> >> >>>> > _______________________________________________ >> >> >> >>> >> >> >> >>>> > ceph-users mailing list >> >> >> >>> >> >> >> >>>> > ceph-users at lists.ceph.com >> >> >> >>> >> >> >> >>>> > >> >> >> >>> >> >> >> >>>> > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> >> >>> >> >> >> >>>> > >> >> >> >>> >> >> >> >>>> _______________________________________________ >> >> >> >>> >> >> >> >>>> ceph-users mailing list >> >> >> >>> >> >> >> >>>> ceph-users at lists.ceph.com >> >> >> >>> >> >> >> >>>> >> >> >> >>> >> >> >> >>>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >> >> >> >> >>> >> >> >> >> >> >> >> >>> >> >> >> >> _______________________________________________ >> >> >> >>> >> >> >> >> ceph-users mailing list >> >> >> >>> >> >> >> >> ceph-users at lists.ceph.com >> >> >> >>> >> >> >> >> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> >> >>> >> >> >> >> >> >> >> >>> >> >> >> > _______________________________________________ >> >> >> >>> >> >> >> > ceph-users mailing list >> >> >> >>> >> >> >> > ceph-users at lists.ceph.com >> >> >> >>> >> >> >> > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> >> >>> >> >> >> _______________________________________________ >> >> >> >>> >> >> >> ceph-users mailing list >> >> >> >>> >> >> >> ceph-users at lists.ceph.com >> >> >> >>> >> >> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> >> >>> >> >> > >> >> >> >>> >> >> > >> >> >> >>> >> >> > >> >> >> >>> >> >> > >> >> >> >>> >> >> > -- >> >> >> >>> >> >> > Randall Smith >> >> >> >>> >> >> > Computing Services >> >> >> >>> >> >> > Adams State University >> >> >> >>> >> >> > http://www.adams.edu/ >> >> >> >>> >> >> > 719-587-7741 >> >> >> >>> >> >> > >> >> >> >>> >> >> > _______________________________________________ >> >> >> >>> >> >> > ceph-users mailing list >> >> >> >>> >> >> > ceph-users at lists.ceph.com >> >> >> >>> >> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> >> >>> >> >> > >> >> >> >>> >> > >> >> >> >>> >> > >> >> >> >>> >> > >> >> >> >>> >> > >> >> >> >>> >> > -- >> >> >> >>> >> > Randall Smith >> >> >> >>> >> > Computing Services >> >> >> >>> >> > Adams State University >> >> >> >>> >> > http://www.adams.edu/ >> >> >> >>> >> > 719-587-7741 >> >> >> >>> >> _______________________________________________ >> >> >> >>> >> ceph-users mailing list >> >> >> >>> >> ceph-users at lists.ceph.com >> >> >> >>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> >> >>> >> >> >> >> >>> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> -- >> >> >> >> Randall Smith >> >> >> >> Computing Services >> >> >> >> Adams State University >> >> >> >> http://www.adams.edu/ >> >> >> >> 719-587-7741 >> >> > >> >> > >> >> > >> >> > >> >> > -- >> >> > Randall Smith >> >> > Computing Services >> >> > Adams State University >> >> > http://www.adams.edu/ >> >> > 719-587-7741 >> > -- Randall Smith Computing Services Adams State University http://www.adams.edu/ 719-587-7741 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140714/7eb561bd/attachment.htm>