scrub error on firefly

rbsmith@xxxxxxxxx (Randy Smith) · Mon, 14 Jul 2014 07:50:35 -0600

$ lsb_release -a
LSB Version:
 core-2.0-amd64:core-2.0-noarch:core-3.0-amd64:core-3.0-noarch:core-3.1-amd64:core-3.1-noarch:core-3.2-amd64:core-3.2-noarch:core-4.0-amd64:core-4.0-noarch
Distributor ID: Ubuntu
Description:    Ubuntu 12.04.4 LTS
Release:        12.04
Codename:       precise

$ uname -a
Linux droopy 3.2.0-64-generic #97-Ubuntu SMP Wed Jun 4 22:04:21 UTC 2014
x86_64 x86_64 x86_64 GNU/Linux

On Sat, Jul 12, 2014 at 3:21 PM, Samuel Just <sam.just at inktank.com> wrote:

> Also, what distribution and kernel version are you using?
> -Sam
> On Jul 12, 2014 10:46 AM, "Samuel Just" <sam.just at inktank.com> wrote:
>
>> When you see another one, can you include the xattrs on the files as
>> well (you can use the attr(1) utility)?
>> -Sam
>>
>> On Sat, Jul 12, 2014 at 9:51 AM, Randy Smith <rbsmith at adams.edu> wrote:
>> > That image is the root file system for a linux ldap server.
>> >
>> > --
>> > Randall Smith
>> > Adams State University
>> > www.adams.edu
>> > 719-587-7741
>> >
>> > On Jul 12, 2014 10:34 AM, "Samuel Just" <sam.just at inktank.com> wrote:
>> >>
>> >> Here's a diff of the two files.  One of the two files appears to
>> >> contain ceph leveldb keys?  Randy, do you have an idea of what this
>> >> rbd image is being used for (rb.0.b0ce3.238e1f29, that is).
>> >> -Sam
>> >>
>> >> On Fri, Jul 11, 2014 at 7:25 PM, Randy Smith <rbsmith at adams.edu>
>> wrote:
>> >> > Greetings,
>> >> >
>> >> > Well it happened again with two pgs this time, still in the same rbd
>> >> > image.
>> >> > They are at http://people.adams.edu/~rbsmith/osd.tar. I think I
>> grabbed
>> >> > the
>> >> > files correctly. If not, let me know and I'll try again on the next
>> >> > failure.
>> >> > It certainly is happening often enough.
>> >> >
>> >> >
>> >> > On Fri, Jul 11, 2014 at 3:39 PM, Samuel Just <sam.just at inktank.com>
>> >> > wrote:
>> >> >>
>> >> >> And grab the xattrs as well.
>> >> >> -Sam
>> >> >>
>> >> >> On Fri, Jul 11, 2014 at 2:39 PM, Samuel Just <sam.just at inktank.com>
>> >> >> wrote:
>> >> >> > Right.
>> >> >> > -Sam
>> >> >> >
>> >> >> > On Fri, Jul 11, 2014 at 2:05 PM, Randy Smith <rbsmith at adams.edu>
>> >> >> > wrote:
>> >> >> >> Greetings,
>> >> >> >>
>> >> >> >> I'm using xfs.
>> >> >> >>
>> >> >> >> Also, when, in a previous email, you asked if I could send the
>> >> >> >> object,
>> >> >> >> do
>> >> >> >> you mean the files from each server named something like this:
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> ./3.c6_head/DIR_6/DIR_C/DIR_5/rb.0.b0ce3.238e1f29.00000000000b__head_34DC35C6__3
>> >> >> >> ?
>> >> >> >>
>> >> >> >>
>> >> >> >> On Fri, Jul 11, 2014 at 2:00 PM, Samuel Just <
>> sam.just at inktank.com>
>> >> >> >> wrote:
>> >> >> >>>
>> >> >> >>> Also, what filesystem are you using?
>> >> >> >>> -Sam
>> >> >> >>>
>> >> >> >>> On Fri, Jul 11, 2014 at 10:37 AM, Sage Weil <sweil at redhat.com>
>> >> >> >>> wrote:
>> >> >> >>> > One other thing we might also try is catching this earlier (on
>> >> >> >>> > first
>> >> >> >>> > read
>> >> >> >>> > of corrupt data) instead of waiting for scrub.  If you are not
>> >> >> >>> > super
>> >> >> >>> > performance sensitive, you can add
>> >> >> >>> >
>> >> >> >>> >  filestore sloppy crc = true
>> >> >> >>> >  filestore sloppy crc block size = 524288
>> >> >> >>> >
>> >> >> >>> > That will track and verify CRCs on any large (>512k) writes.
>> >> >> >>> > Smaller
>> >> >> >>> > block sizes will give more precision and more checks, but will
>> >> >> >>> > generate
>> >> >> >>> > larger xattrs and have a bigger impact on performance...
>> >> >> >>> >
>> >> >> >>> > sage
>> >> >> >>> >
>> >> >> >>> >
>> >> >> >>> > On Fri, 11 Jul 2014, Samuel Just wrote:
>> >> >> >>> >
>> >> >> >>> >> When you get the next inconsistency, can you copy the actual
>> >> >> >>> >> objects
>> >> >> >>> >> from the osd store trees and get them to us?  That might
>> provide
>> >> >> >>> >> a
>> >> >> >>> >> clue.
>> >> >> >>> >> -Sam
>> >> >> >>> >>
>> >> >> >>> >> On Fri, Jul 11, 2014 at 6:52 AM, Randy Smith <
>> rbsmith at adams.edu>
>> >> >> >>> >> wrote:
>> >> >> >>> >> >
>> >> >> >>> >> >
>> >> >> >>> >> >
>> >> >> >>> >> > On Thu, Jul 10, 2014 at 4:40 PM, Samuel Just
>> >> >> >>> >> > <sam.just at inktank.com>
>> >> >> >>> >> > wrote:
>> >> >> >>> >> >>
>> >> >> >>> >> >> It could be an indication of a problem on osd 5, but the
>> >> >> >>> >> >> timing
>> >> >> >>> >> >> is
>> >> >> >>> >> >> worrying.  Can you attach your ceph.conf?
>> >> >> >>> >> >
>> >> >> >>> >> >
>> >> >> >>> >> > Attached.
>> >> >> >>> >> >
>> >> >> >>> >> >>
>> >> >> >>> >> >> Have there been any osds
>> >> >> >>> >> >> going down, new osds added, anything to cause recovery?
>> >> >> >>> >> >
>> >> >> >>> >> >
>> >> >> >>> >> > I upgraded to firefly last week. As part of the upgrade I,
>> >> >> >>> >> > obviously,
>> >> >> >>> >> > had to
>> >> >> >>> >> > restart every osd. Also, I attempted to switch to the
>> optimal
>> >> >> >>> >> > tunables but
>> >> >> >>> >> > doing so degraded 27% of my cluster and made most of my VMs
>> >> >> >>> >> > unresponsive. I
>> >> >> >>> >> > switched back to the legacy tunables and everything was
>> happy
>> >> >> >>> >> > again.
>> >> >> >>> >> > Both of
>> >> >> >>> >> > those operations, of course, caused recoveries. I have
>> made no
>> >> >> >>> >> > changes since
>> >> >> >>> >> > then.
>> >> >> >>> >> >
>> >> >> >>> >> >>
>> >> >> >>> >> >>  Anything in
>> >> >> >>> >> >> dmesg to indicate an fs problem?
>> >> >> >>> >> >
>> >> >> >>> >> >
>> >> >> >>> >> > Nothing. The system went inconsistent again this morning,
>> >> >> >>> >> > again
>> >> >> >>> >> > on
>> >> >> >>> >> > the same
>> >> >> >>> >> > rbd but different osds this time.
>> >> >> >>> >> >
>> >> >> >>> >> > 2014-07-11 05:48:12.857657 osd.1 192.168.253.77:6801/12608
>> 904
>> >> >> >>> >> > :
>> >> >> >>> >> > [ERR] 3.76
>> >> >> >>> >> > shard 1: soid
>> 1280076/rb.0.b0ce3.238e1f29.00000000025c/head//3
>> >> >> >>> >> > digest
>> >> >> >>> >> > 2198242284 != known digest 3879754377
>> >> >> >>> >> > 2014-07-11 05:49:29.020024 osd.1 192.168.253.77:6801/12608
>> 905
>> >> >> >>> >> > :
>> >> >> >>> >> > [ERR] 3.76
>> >> >> >>> >> > deep-scrub 0 missing, 1 inconsistent objects
>> >> >> >>> >> > 2014-07-11 05:49:29.020029 osd.1 192.168.253.77:6801/12608
>> 906
>> >> >> >>> >> > :
>> >> >> >>> >> > [ERR] 3.76
>> >> >> >>> >> > deep-scrub 1 errors
>> >> >> >>> >> >
>> >> >> >>> >> > $ ceph health detail
>> >> >> >>> >> > HEALTH_ERR 1 pgs inconsistent; 1 scrub errors
>> >> >> >>> >> > pg 3.76 is active+clean+inconsistent, acting [1,2]
>> >> >> >>> >> > 1 scrub errors
>> >> >> >>> >> >
>> >> >> >>> >> >
>> >> >> >>> >> >>
>> >> >> >>> >> >>  Have you recently changed any
>> >> >> >>> >> >> settings?
>> >> >> >>> >> >
>> >> >> >>> >> >
>> >> >> >>> >> > I upgraded from bobtail to dumpling to firefly.
>> >> >> >>> >> >
>> >> >> >>> >> >>
>> >> >> >>> >> >> -Sam
>> >> >> >>> >> >>
>> >> >> >>> >> >> On Thu, Jul 10, 2014 at 2:58 PM, Randy Smith
>> >> >> >>> >> >> <rbsmith at adams.edu>
>> >> >> >>> >> >> wrote:
>> >> >> >>> >> >> > Greetings,
>> >> >> >>> >> >> >
>> >> >> >>> >> >> > Just a follow up on my original issue. =ceph pg repair
>> ...=
>> >> >> >>> >> >> > fixed
>> >> >> >>> >> >> > the
>> >> >> >>> >> >> > problem. However, today I got another inconsistent pg.
>> It's
>> >> >> >>> >> >> > interesting
>> >> >> >>> >> >> > to
>> >> >> >>> >> >> > me that this second error is in the same rbd image and
>> >> >> >>> >> >> > appears
>> >> >> >>> >> >> > to
>> >> >> >>> >> >> > be
>> >> >> >>> >> >> > "close"
>> >> >> >>> >> >> > to the previously inconsistent pg. (Even more fun, osd.5
>> >> >> >>> >> >> > was
>> >> >> >>> >> >> > the
>> >> >> >>> >> >> > secondary
>> >> >> >>> >> >> > in the first error and is the primary here though the
>> other
>> >> >> >>> >> >> > osd is
>> >> >> >>> >> >> > different.)
>> >> >> >>> >> >> >
>> >> >> >>> >> >> > Is this indicative of a problem on osd.5 or perhaps a
>> clue
>> >> >> >>> >> >> > into
>> >> >> >>> >> >> > what's
>> >> >> >>> >> >> > causing firefly to be so inconsistent?
>> >> >> >>> >> >> >
>> >> >> >>> >> >> > The relevant log entries are below.
>> >> >> >>> >> >> >
>> >> >> >>> >> >> > 2014-07-07 18:50:48.646407 osd.2
>> 192.168.253.70:6801/56987
>> >> >> >>> >> >> > 163
>> >> >> >>> >> >> > :
>> >> >> >>> >> >> > [ERR]
>> >> >> >>> >> >> > 3.c6
>> >> >> >>> >> >> > shard 2: soid
>> >> >> >>> >> >> > 34dc35c6/rb.0.b0ce3.238e1f29.00000000000b/head//3
>> >> >> >>> >> >> > digest
>> >> >> >>> >> >> > 2256074002 != known digest 3998068918
>> >> >> >>> >> >> > 2014-07-07 18:51:36.936076 osd.2
>> 192.168.253.70:6801/56987
>> >> >> >>> >> >> > 164
>> >> >> >>> >> >> > :
>> >> >> >>> >> >> > [ERR]
>> >> >> >>> >> >> > 3.c6
>> >> >> >>> >> >> > deep-scrub 0 missing, 1 inconsistent objects
>> >> >> >>> >> >> > 2014-07-07 18:51:36.936082 osd.2
>> 192.168.253.70:6801/56987
>> >> >> >>> >> >> > 165
>> >> >> >>> >> >> > :
>> >> >> >>> >> >> > [ERR]
>> >> >> >>> >> >> > 3.c6
>> >> >> >>> >> >> > deep-scrub 1 errors
>> >> >> >>> >> >> >
>> >> >> >>> >> >> >
>> >> >> >>> >> >> > 2014-07-10 15:38:53.990328 osd.5
>> 192.168.253.81:6800/10013
>> >> >> >>> >> >> > 257
>> >> >> >>> >> >> > :
>> >> >> >>> >> >> > [ERR]
>> >> >> >>> >> >> > 3.41
>> >> >> >>> >> >> > shard 1: soid
>> >> >> >>> >> >> > e183cc41/rb.0.b0ce3.238e1f29.00000000024c/head//3
>> >> >> >>> >> >> > digest
>> >> >> >>> >> >> > 3224286363 != known digest 3409342281
>> >> >> >>> >> >> > 2014-07-10 15:39:11.701276 osd.5
>> 192.168.253.81:6800/10013
>> >> >> >>> >> >> > 258
>> >> >> >>> >> >> > :
>> >> >> >>> >> >> > [ERR]
>> >> >> >>> >> >> > 3.41
>> >> >> >>> >> >> > deep-scrub 0 missing, 1 inconsistent objects
>> >> >> >>> >> >> > 2014-07-10 15:39:11.701281 osd.5
>> 192.168.253.81:6800/10013
>> >> >> >>> >> >> > 259
>> >> >> >>> >> >> > :
>> >> >> >>> >> >> > [ERR]
>> >> >> >>> >> >> > 3.41
>> >> >> >>> >> >> > deep-scrub 1 errors
>> >> >> >>> >> >> >
>> >> >> >>> >> >> >
>> >> >> >>> >> >> >
>> >> >> >>> >> >> > On Thu, Jul 10, 2014 at 12:05 PM, Chahal, Sudip
>> >> >> >>> >> >> > <sudip.chahal at intel.com>
>> >> >> >>> >> >> > wrote:
>> >> >> >>> >> >> >>
>> >> >> >>> >> >> >> Thanks - so it appears that the advantage of the 3rd
>> >> >> >>> >> >> >> replica
>> >> >> >>> >> >> >> (relative
>> >> >> >>> >> >> >> to
>> >> >> >>> >> >> >> 2 replicas) has to do much more with recovering from
>> two
>> >> >> >>> >> >> >> concurrent OSD
>> >> >> >>> >> >> >> failures than with inconsistencies found during deep
>> scrub
>> >> >> >>> >> >> >> -
>> >> >> >>> >> >> >> would you
>> >> >> >>> >> >> >> agree?
>> >> >> >>> >> >> >>
>> >> >> >>> >> >> >> Re: repair - do you mean the "repair" process during
>> deep
>> >> >> >>> >> >> >> scrub
>> >> >> >>> >> >> >> - if
>> >> >> >>> >> >> >> yes,
>> >> >> >>> >> >> >> this is automatic - correct?
>> >> >> >>> >> >> >>     Or
>> >> >> >>> >> >> >> Are you referring to the explicit manually initiated
>> >> >> >>> >> >> >> repair
>> >> >> >>> >> >> >> commands?
>> >> >> >>> >> >> >>
>> >> >> >>> >> >> >> Thanks,
>> >> >> >>> >> >> >>
>> >> >> >>> >> >> >> -Sudip
>> >> >> >>> >> >> >>
>> >> >> >>> >> >> >> -----Original Message-----
>> >> >> >>> >> >> >> From: Samuel Just [mailto:sam.just at inktank.com]
>> >> >> >>> >> >> >> Sent: Thursday, July 10, 2014 10:50 AM
>> >> >> >>> >> >> >> To: Chahal, Sudip
>> >> >> >>> >> >> >> Cc: Christian Eichelmann; ceph-users at lists.ceph.com
>> >> >> >>> >> >> >> Subject: Re: [ceph-users] scrub error on firefly
>> >> >> >>> >> >> >>
>> >> >> >>> >> >> >> Repair I think will tend to choose the copy with the
>> >> >> >>> >> >> >> lowest
>> >> >> >>> >> >> >> osd
>> >> >> >>> >> >> >> number
>> >> >> >>> >> >> >> which is not obviously corrupted.  Even with three
>> >> >> >>> >> >> >> replicas,
>> >> >> >>> >> >> >> it
>> >> >> >>> >> >> >> does
>> >> >> >>> >> >> >> not do
>> >> >> >>> >> >> >> any kind of voting at this time.
>> >> >> >>> >> >> >> -Sam
>> >> >> >>> >> >> >>
>> >> >> >>> >> >> >> On Thu, Jul 10, 2014 at 10:39 AM, Chahal, Sudip
>> >> >> >>> >> >> >> <sudip.chahal at intel.com>
>> >> >> >>> >> >> >> wrote:
>> >> >> >>> >> >> >> > I've a basic related question re: Firefly operation -
>> >> >> >>> >> >> >> > would
>> >> >> >>> >> >> >> > appreciate
>> >> >> >>> >> >> >> > any insights:
>> >> >> >>> >> >> >> >
>> >> >> >>> >> >> >> > With three replicas, if checksum inconsistencies
>> across
>> >> >> >>> >> >> >> > replicas are
>> >> >> >>> >> >> >> > found during deep-scrub then:
>> >> >> >>> >> >> >> >         a.  does the majority win or is the primary
>> >> >> >>> >> >> >> > always
>> >> >> >>> >> >> >> > the
>> >> >> >>> >> >> >> > winner
>> >> >> >>> >> >> >> > and used to overwrite the secondaries
>> >> >> >>> >> >> >> >                 b. is this reconciliation done
>> >> >> >>> >> >> >> > automatically
>> >> >> >>> >> >> >> > during
>> >> >> >>> >> >> >> > deep-scrub or does each reconciliation have to be
>> >> >> >>> >> >> >> > executed
>> >> >> >>> >> >> >> > manually
>> >> >> >>> >> >> >> > by the
>> >> >> >>> >> >> >> > administrator?
>> >> >> >>> >> >> >> >
>> >> >> >>> >> >> >> > With 2 replicas - how are things different (if at
>> all):
>> >> >> >>> >> >> >> >                a. The primary is declared the winner
>> -
>> >> >> >>> >> >> >> > correct?
>> >> >> >>> >> >> >> >                b. is this reconciliation done
>> >> >> >>> >> >> >> > automatically
>> >> >> >>> >> >> >> > during
>> >> >> >>> >> >> >> > deep-scrub or does it have to be done "manually"
>> because
>> >> >> >>> >> >> >> > there
>> >> >> >>> >> >> >> > is no
>> >> >> >>> >> >> >> > majority?
>> >> >> >>> >> >> >> >
>> >> >> >>> >> >> >> > Thanks,
>> >> >> >>> >> >> >> >
>> >> >> >>> >> >> >> > -Sudip
>> >> >> >>> >> >> >> >
>> >> >> >>> >> >> >> >
>> >> >> >>> >> >> >> > -----Original Message-----
>> >> >> >>> >> >> >> > From: ceph-users
>> >> >> >>> >> >> >> > [mailto:ceph-users-bounces at lists.ceph.com]
>> >> >> >>> >> >> >> > On
>> >> >> >>> >> >> >> > Behalf
>> >> >> >>> >> >> >> > Of Samuel Just
>> >> >> >>> >> >> >> > Sent: Thursday, July 10, 2014 10:16 AM
>> >> >> >>> >> >> >> > To: Christian Eichelmann
>> >> >> >>> >> >> >> > Cc: ceph-users at lists.ceph.com
>> >> >> >>> >> >> >> > Subject: Re: [ceph-users] scrub error on firefly
>> >> >> >>> >> >> >> >
>> >> >> >>> >> >> >> > Can you attach your ceph.conf for your osds?
>> >> >> >>> >> >> >> > -Sam
>> >> >> >>> >> >> >> >
>> >> >> >>> >> >> >> > On Thu, Jul 10, 2014 at 8:01 AM, Christian Eichelmann
>> >> >> >>> >> >> >> > <christian.eichelmann at 1und1.de> wrote:
>> >> >> >>> >> >> >> >> I can also confirm that after upgrading to firefly
>> both
>> >> >> >>> >> >> >> >> of
>> >> >> >>> >> >> >> >> our
>> >> >> >>> >> >> >> >> clusters (test and live) were going from 0 scrub
>> errors
>> >> >> >>> >> >> >> >> each
>> >> >> >>> >> >> >> >> for
>> >> >> >>> >> >> >> >> about
>> >> >> >>> >> >> >> >> 6 Month to about 9-12 per week...
>> >> >> >>> >> >> >> >> This also makes me kind of nervous, since as far as
>> I
>> >> >> >>> >> >> >> >> know
>> >> >> >>> >> >> >> >> everything
>> >> >> >>> >> >> >> >> "ceph pg repair" does, is to copy the primary
>> object to
>> >> >> >>> >> >> >> >> all
>> >> >> >>> >> >> >> >> replicas,
>> >> >> >>> >> >> >> >> no matter which object is the correct one.
>> >> >> >>> >> >> >> >> Of course the described method of manual checking
>> works
>> >> >> >>> >> >> >> >> (for
>> >> >> >>> >> >> >> >> pools
>> >> >> >>> >> >> >> >> with more than 2 replicas), but doing this in a
>> large
>> >> >> >>> >> >> >> >> cluster
>> >> >> >>> >> >> >> >> nearly
>> >> >> >>> >> >> >> >> every week is horribly timeconsuming and error
>> prone.
>> >> >> >>> >> >> >> >> It would be great to get an explanation for the
>> >> >> >>> >> >> >> >> increased
>> >> >> >>> >> >> >> >> numbers of
>> >> >> >>> >> >> >> >> scrub errors since firefly. Were they just not
>> detected
>> >> >> >>> >> >> >> >> correctly in
>> >> >> >>> >> >> >> >> previous versions? Or is there maybe something wrong
>> >> >> >>> >> >> >> >> with
>> >> >> >>> >> >> >> >> the
>> >> >> >>> >> >> >> >> new
>> >> >> >>> >> >> >> >> code?
>> >> >> >>> >> >> >> >>
>> >> >> >>> >> >> >> >> Acutally, our company is currently preventing our
>> >> >> >>> >> >> >> >> projects
>> >> >> >>> >> >> >> >> to
>> >> >> >>> >> >> >> >> move
>> >> >> >>> >> >> >> >> to
>> >> >> >>> >> >> >> >> ceph because of this problem.
>> >> >> >>> >> >> >> >>
>> >> >> >>> >> >> >> >> Regards,
>> >> >> >>> >> >> >> >> Christian
>> >> >> >>> >> >> >> >> ________________________________
>> >> >> >>> >> >> >> >> Von: ceph-users [ceph-users-bounces at lists.ceph.com]"
>> im
>> >> >> >>> >> >> >> >> Auftrag von
>> >> >> >>> >> >> >> >> "Travis Rhoden [trhoden at gmail.com]
>> >> >> >>> >> >> >> >> Gesendet: Donnerstag, 10. Juli 2014 16:24
>> >> >> >>> >> >> >> >> An: Gregory Farnum
>> >> >> >>> >> >> >> >> Cc: ceph-users at lists.ceph.com
>> >> >> >>> >> >> >> >> Betreff: Re: [ceph-users] scrub error on firefly
>> >> >> >>> >> >> >> >>
>> >> >> >>> >> >> >> >> And actually just to follow-up, it does seem like
>> there
>> >> >> >>> >> >> >> >> are
>> >> >> >>> >> >> >> >> some
>> >> >> >>> >> >> >> >> additional smarts beyond just using the primary to
>> >> >> >>> >> >> >> >> overwrite
>> >> >> >>> >> >> >> >> the
>> >> >> >>> >> >> >> >> secondaries...  Since I captured md5 sums before and
>> >> >> >>> >> >> >> >> after
>> >> >> >>> >> >> >> >> the
>> >> >> >>> >> >> >> >> repair, I can say that in this particular instance,
>> the
>> >> >> >>> >> >> >> >> secondary
>> >> >> >>> >> >> >> >> copy
>> >> >> >>> >> >> >> >> was used to overwrite the primary.
>> >> >> >>> >> >> >> >> So, I'm just trusting Ceph to the right thing, and
>> so
>> >> >> >>> >> >> >> >> far
>> >> >> >>> >> >> >> >> it
>> >> >> >>> >> >> >> >> seems
>> >> >> >>> >> >> >> >> to, but the comments here about needing to determine
>> >> >> >>> >> >> >> >> the
>> >> >> >>> >> >> >> >> correct
>> >> >> >>> >> >> >> >> object and place it on the primary PG make me
>> wonder if
>> >> >> >>> >> >> >> >> I've
>> >> >> >>> >> >> >> >> been
>> >> >> >>> >> >> >> >> missing something.
>> >> >> >>> >> >> >> >>
>> >> >> >>> >> >> >> >>  - Travis
>> >> >> >>> >> >> >> >>
>> >> >> >>> >> >> >> >>
>> >> >> >>> >> >> >> >> On Thu, Jul 10, 2014 at 10:19 AM, Travis Rhoden
>> >> >> >>> >> >> >> >> <trhoden at gmail.com>
>> >> >> >>> >> >> >> >> wrote:
>> >> >> >>> >> >> >> >>>
>> >> >> >>> >> >> >> >>> I can also say that after a recent upgrade to
>> Firefly,
>> >> >> >>> >> >> >> >>> I
>> >> >> >>> >> >> >> >>> have
>> >> >> >>> >> >> >> >>> experienced massive uptick in scrub errors.  The
>> >> >> >>> >> >> >> >>> cluster
>> >> >> >>> >> >> >> >>> was
>> >> >> >>> >> >> >> >>> on
>> >> >> >>> >> >> >> >>> cuttlefish for about a year, and had maybe one or
>> two
>> >> >> >>> >> >> >> >>> scrub
>> >> >> >>> >> >> >> >>> errors.
>> >> >> >>> >> >> >> >>> After upgrading to Firefly, we've probably seen 3
>> to 4
>> >> >> >>> >> >> >> >>> dozen
>> >> >> >>> >> >> >> >>> in the
>> >> >> >>> >> >> >> >>> last month or so (was getting 2-3 a day for a few
>> >> >> >>> >> >> >> >>> weeks
>> >> >> >>> >> >> >> >>> until
>> >> >> >>> >> >> >> >>> the
>> >> >> >>> >> >> >> >>> whole cluster was rescrubbed, it seemed).
>> >> >> >>> >> >> >> >>>
>> >> >> >>> >> >> >> >>> What I cannot determine, however, is how to know
>> which
>> >> >> >>> >> >> >> >>> object
>> >> >> >>> >> >> >> >>> is
>> >> >> >>> >> >> >> >>> busted?
>> >> >> >>> >> >> >> >>> For example, just today I ran into a scrub error.
>>  The
>> >> >> >>> >> >> >> >>> object
>> >> >> >>> >> >> >> >>> has
>> >> >> >>> >> >> >> >>> two copies and is an 8MB piece of an RBD, and has
>> >> >> >>> >> >> >> >>> identical
>> >> >> >>> >> >> >> >>> timestamps, identical xattrs names and values.
>>  But it
>> >> >> >>> >> >> >> >>> definitely
>> >> >> >>> >> >> >> >>> has a different
>> >> >> >>> >> >> >> >>> MD5 sum. How to know which one is correct?
>> >> >> >>> >> >> >> >>>
>> >> >> >>> >> >> >> >>> I've been just kicking off pg repair each time,
>> which
>> >> >> >>> >> >> >> >>> seems
>> >> >> >>> >> >> >> >>> to just
>> >> >> >>> >> >> >> >>> use the primary copy to overwrite the others.
>>  Haven't
>> >> >> >>> >> >> >> >>> run
>> >> >> >>> >> >> >> >>> into any
>> >> >> >>> >> >> >> >>> issues with that so far, but it does make me
>> nervous.
>> >> >> >>> >> >> >> >>>
>> >> >> >>> >> >> >> >>>  - Travis
>> >> >> >>> >> >> >> >>>
>> >> >> >>> >> >> >> >>>
>> >> >> >>> >> >> >> >>> On Tue, Jul 8, 2014 at 1:06 AM, Gregory Farnum
>> >> >> >>> >> >> >> >>> <greg at inktank.com>
>> >> >> >>> >> >> >> >>> wrote:
>> >> >> >>> >> >> >> >>>>
>> >> >> >>> >> >> >> >>>> It's not very intuitive or easy to look at right
>> now
>> >> >> >>> >> >> >> >>>> (there
>> >> >> >>> >> >> >> >>>> are
>> >> >> >>> >> >> >> >>>> plans from the recent developer summit to improve
>> >> >> >>> >> >> >> >>>> things),
>> >> >> >>> >> >> >> >>>> but the
>> >> >> >>> >> >> >> >>>> central log should have output about exactly what
>> >> >> >>> >> >> >> >>>> objects
>> >> >> >>> >> >> >> >>>> are
>> >> >> >>> >> >> >> >>>> busted. You'll then want to compare the copies
>> >> >> >>> >> >> >> >>>> manually
>> >> >> >>> >> >> >> >>>> to
>> >> >> >>> >> >> >> >>>> determine which ones are good or bad, get the good
>> >> >> >>> >> >> >> >>>> copy
>> >> >> >>> >> >> >> >>>> on
>> >> >> >>> >> >> >> >>>> the
>> >> >> >>> >> >> >> >>>> primary (make sure you preserve xattrs), and run
>> >> >> >>> >> >> >> >>>> repair.
>> >> >> >>> >> >> >> >>>> -Greg
>> >> >> >>> >> >> >> >>>> Software Engineer #42 @ http://inktank.com |
>> >> >> >>> >> >> >> >>>> http://ceph.com
>> >> >> >>> >> >> >> >>>>
>> >> >> >>> >> >> >> >>>>
>> >> >> >>> >> >> >> >>>> On Mon, Jul 7, 2014 at 6:48 PM, Randy Smith
>> >> >> >>> >> >> >> >>>> <rbsmith at adams.edu>
>> >> >> >>> >> >> >> >>>> wrote:
>> >> >> >>> >> >> >> >>>> > Greetings,
>> >> >> >>> >> >> >> >>>> >
>> >> >> >>> >> >> >> >>>> > I upgraded to firefly last week and I suddenly
>> >> >> >>> >> >> >> >>>> > received
>> >> >> >>> >> >> >> >>>> > this
>> >> >> >>> >> >> >> >>>> > error:
>> >> >> >>> >> >> >> >>>> >
>> >> >> >>> >> >> >> >>>> > health HEALTH_ERR 1 pgs inconsistent; 1 scrub
>> >> >> >>> >> >> >> >>>> > errors
>> >> >> >>> >> >> >> >>>> >
>> >> >> >>> >> >> >> >>>> > ceph health detail shows the following:
>> >> >> >>> >> >> >> >>>> >
>> >> >> >>> >> >> >> >>>> > HEALTH_ERR 1 pgs inconsistent; 1 scrub errors pg
>> >> >> >>> >> >> >> >>>> > 3.c6
>> >> >> >>> >> >> >> >>>> > is
>> >> >> >>> >> >> >> >>>> > active+clean+inconsistent, acting [2,5]
>> >> >> >>> >> >> >> >>>> > 1 scrub errors
>> >> >> >>> >> >> >> >>>> >
>> >> >> >>> >> >> >> >>>> > The docs say that I can run `ceph pg repair
>> 3.c6`
>> >> >> >>> >> >> >> >>>> > to
>> >> >> >>> >> >> >> >>>> > fix
>> >> >> >>> >> >> >> >>>> > this.
>> >> >> >>> >> >> >> >>>> > What I want to know is what are the risks of
>> data
>> >> >> >>> >> >> >> >>>> > loss
>> >> >> >>> >> >> >> >>>> > if
>> >> >> >>> >> >> >> >>>> > I run
>> >> >> >>> >> >> >> >>>> > that command in this state and how can I
>> mitigate
>> >> >> >>> >> >> >> >>>> > them?
>> >> >> >>> >> >> >> >>>> >
>> >> >> >>> >> >> >> >>>> > --
>> >> >> >>> >> >> >> >>>> > Randall Smith
>> >> >> >>> >> >> >> >>>> > Computing Services
>> >> >> >>> >> >> >> >>>> > Adams State University
>> >> >> >>> >> >> >> >>>> > http://www.adams.edu/
>> >> >> >>> >> >> >> >>>> > 719-587-7741
>> >> >> >>> >> >> >> >>>> >
>> >> >> >>> >> >> >> >>>> > _______________________________________________
>> >> >> >>> >> >> >> >>>> > ceph-users mailing list
>> >> >> >>> >> >> >> >>>> > ceph-users at lists.ceph.com
>> >> >> >>> >> >> >> >>>> >
>> >> >> >>> >> >> >> >>>> >
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >> >> >>> >> >> >> >>>> >
>> >> >> >>> >> >> >> >>>> _______________________________________________
>> >> >> >>> >> >> >> >>>> ceph-users mailing list
>> >> >> >>> >> >> >> >>>> ceph-users at lists.ceph.com
>> >> >> >>> >> >> >> >>>>
>> >> >> >>> >> >> >> >>>>
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >> >> >>> >> >> >> >>>
>> >> >> >>> >> >> >> >>>
>> >> >> >>> >> >> >> >>
>> >> >> >>> >> >> >> >>
>> >> >> >>> >> >> >> >> _______________________________________________
>> >> >> >>> >> >> >> >> ceph-users mailing list
>> >> >> >>> >> >> >> >> ceph-users at lists.ceph.com
>> >> >> >>> >> >> >> >>
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >> >> >>> >> >> >> >>
>> >> >> >>> >> >> >> > _______________________________________________
>> >> >> >>> >> >> >> > ceph-users mailing list
>> >> >> >>> >> >> >> > ceph-users at lists.ceph.com
>> >> >> >>> >> >> >> >
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >> >> >>> >> >> >> _______________________________________________
>> >> >> >>> >> >> >> ceph-users mailing list
>> >> >> >>> >> >> >> ceph-users at lists.ceph.com
>> >> >> >>> >> >> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >> >> >>> >> >> >
>> >> >> >>> >> >> >
>> >> >> >>> >> >> >
>> >> >> >>> >> >> >
>> >> >> >>> >> >> > --
>> >> >> >>> >> >> > Randall Smith
>> >> >> >>> >> >> > Computing Services
>> >> >> >>> >> >> > Adams State University
>> >> >> >>> >> >> > http://www.adams.edu/
>> >> >> >>> >> >> > 719-587-7741
>> >> >> >>> >> >> >
>> >> >> >>> >> >> > _______________________________________________
>> >> >> >>> >> >> > ceph-users mailing list
>> >> >> >>> >> >> > ceph-users at lists.ceph.com
>> >> >> >>> >> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >> >> >>> >> >> >
>> >> >> >>> >> >
>> >> >> >>> >> >
>> >> >> >>> >> >
>> >> >> >>> >> >
>> >> >> >>> >> > --
>> >> >> >>> >> > Randall Smith
>> >> >> >>> >> > Computing Services
>> >> >> >>> >> > Adams State University
>> >> >> >>> >> > http://www.adams.edu/
>> >> >> >>> >> > 719-587-7741
>> >> >> >>> >> _______________________________________________
>> >> >> >>> >> ceph-users mailing list
>> >> >> >>> >> ceph-users at lists.ceph.com
>> >> >> >>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >> >> >>> >>
>> >> >> >>> >>
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> --
>> >> >> >> Randall Smith
>> >> >> >> Computing Services
>> >> >> >> Adams State University
>> >> >> >> http://www.adams.edu/
>> >> >> >> 719-587-7741
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Randall Smith
>> >> > Computing Services
>> >> > Adams State University
>> >> > http://www.adams.edu/
>> >> > 719-587-7741
>>
>

-- 
Randall Smith
Computing Services
Adams State University
http://www.adams.edu/
719-587-7741
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140714/7eb561bd/attachment.htm>