Re: Backfilling pgs not making progress

Samuel Just <sjust@xxxxxxxxxx> · Thu, 11 Aug 2016 10:16:15 -0700



I just updated the bug with several questions.
-Sam

On Thu, Aug 11, 2016 at 6:56 AM, Brian Felton <bjfelton@xxxxxxxxx> wrote:
> Sam,
>
> I very much appreciate the assistance.  I have opened
> http://tracker.ceph.com/issues/16997 to track this (potential) issue.
>
> Brian
>
> On Wed, Aug 10, 2016 at 1:53 PM, Samuel Just <sjust@xxxxxxxxxx> wrote:
>>
>> Ok, can you
>> 1) Open a bug
>> 2) Identify all osds involved in the 5 problem pgs
>> 3) enable debug osd = 20, debug filestore = 20, debug ms = 1 on all of
>> them
>> 4) mark the primary for each pg down (should cause peering and
>> backfill to restart)
>> 5) link all logs to the bug
>>
>> Thanks!
>> -Sam
>>
>> On Tue, Jul 26, 2016 at 9:11 AM, Samuel Just <sjust@xxxxxxxxxx> wrote:
>> > Hmm, nvm, it's not an lfn object anyway.
>> > -Sam
>> >
>> > On Tue, Jul 26, 2016 at 7:07 AM, Brian Felton <bjfelton@xxxxxxxxx>
>> > wrote:
>> >> If I search on osd.580, I find
>> >>
>> >> default.421929.15\uTEPP\s84316222-6ddd-4ac9-8283-6fa1cdcf9b88\sbackups\s20160630091353\sp1\s\sShares\sWarehouse\sLondonWarehouse\sLondon\sRon
>> >> picture's\sMISCELLANEOUS\s2014\sOct., 2014\sOct.
>> >> 1\sDSC04329.JPG__head_981926C1__21_ffffffffffffffff_5, which has a
>> >> non-zero
>> >> size and a hash (981926C1) that matches that of the same file found on
>> >> the
>> >> other OSDs in the pg.
>> >>
>> >> If I'm misunderstanding what you're asking about a dangling link,
>> >> please
>> >> point me in the right direction.
>> >>
>> >> Brian
>> >>
>> >> On Tue, Jul 26, 2016 at 8:59 AM, Samuel Just <sjust@xxxxxxxxxx> wrote:
>> >>>
>> >>> Did you also confirm that the backfill target does not have any of
>> >>> those dangling links?  I'd be looking for a dangling link for
>> >>>
>> >>>
>> >>> 981926c1/default.421929.15_TEPP/84316222-6ddd-4ac9-8283-6fa1cdcf9b88/backups/20160630091353/p1//Shares/Warehouse/LondonWarehouse/London/Ron
>> >>> picture's/MISCELLANEOUS/2014/Oct., 2014/Oct. 1/DSC04329.JPG/head//33
>> >>> on osd.580.
>> >>> -Sam
>> >>>
>> >>> On Mon, Jul 25, 2016 at 9:04 PM, Brian Felton <bjfelton@xxxxxxxxx>
>> >>> wrote:
>> >>> > Sam,
>> >>> >
>> >>> > I cranked up the logging on the backfill target (osd 580 on node 07)
>> >>> > and
>> >>> > the
>> >>> > acting primary for the pg (453 on node 08, for what it's worth).
>> >>> > The
>> >>> > logs
>> >>> > from the primary are very large, so pardon the tarballs.
>> >>> >
>> >>> > PG Primary Logs:
>> >>> >
>> >>> > https://www.dropbox.com/s/ipjobn2i5ban9km/backfill-primary-log.tgz?dl=0B
>> >>> > PG Backfill Target Logs:
>> >>> >
>> >>> > https://www.dropbox.com/s/9qpiqsnahx0qc5k/backfill-target-log.tgz?dl=0
>> >>> >
>> >>> > I'll be reviewing them with my team tomorrow morning to see if we
>> >>> > can
>> >>> > find
>> >>> > anything.  Thanks for your assistance.
>> >>> >
>> >>> > Brian
>> >>> >
>> >>> > On Mon, Jul 25, 2016 at 3:33 PM, Samuel Just <sjust@xxxxxxxxxx>
>> >>> > wrote:
>> >>> >>
>> >>> >> The next thing I'd want is for you to reproduce with
>> >>> >>
>> >>> >> debug osd = 20
>> >>> >> debug filestore = 20
>> >>> >> debug ms = 1
>> >>> >>
>> >>> >> and post the file somewhere.
>> >>> >> -Sam
>> >>> >>
>> >>> >> On Mon, Jul 25, 2016 at 1:33 PM, Samuel Just <sjust@xxxxxxxxxx>
>> >>> >> wrote:
>> >>> >> > If you don't have the orphaned file link, it's not the same bug.
>> >>> >> > -Sam
>> >>> >> >
>> >>> >> > On Mon, Jul 25, 2016 at 12:55 PM, Brian Felton
>> >>> >> > <bjfelton@xxxxxxxxx>
>> >>> >> > wrote:
>> >>> >> >> Sam,
>> >>> >> >>
>> >>> >> >> I'm reviewing that thread now, but I'm not seeing a lot of
>> >>> >> >> overlap
>> >>> >> >> with
>> >>> >> >> my
>> >>> >> >> cluster's situation.  For one, I am unable to start either a
>> >>> >> >> repair
>> >>> >> >> or
>> >>> >> >> a
>> >>> >> >> deep scrub on any of the affected pgs.  I've instructed all six
>> >>> >> >> of
>> >>> >> >> the
>> >>> >> >> pgs
>> >>> >> >> to scrub, deep-scrub, and repair, and the cluster has been
>> >>> >> >> gleefully
>> >>> >> >> ignoring these requests (it has been several hours since I first
>> >>> >> >> tried,
>> >>> >> >> and
>> >>> >> >> the logs indicate none of the pgs ever scrubbed).  Second, none
>> >>> >> >> of
>> >>> >> >> the
>> >>> >> >> my
>> >>> >> >> OSDs is crashing.  Third, none of my pgs or objects has ever
>> >>> >> >> been
>> >>> >> >> marked
>> >>> >> >> inconsistent (or unfound, for that matter) -- I'm only seeing
>> >>> >> >> the
>> >>> >> >> standard
>> >>> >> >> mix of degraded/misplaced objects that are common during a
>> >>> >> >> recovery.
>> >>> >> >> What
>> >>> >> >> I'm not seeing is any further progress on the number of
>> >>> >> >> misplaced
>> >>> >> >> objects --
>> >>> >> >> the number has remained effectively unchanged for the past
>> >>> >> >> several
>> >>> >> >> days.
>> >>> >> >>
>> >>> >> >> To be sure, though, I tracked down the file that the backfill
>> >>> >> >> operation
>> >>> >> >> seems to be hung on, and I can find it in both the backfill
>> >>> >> >> target
>> >>> >> >> osd
>> >>> >> >> (580)
>> >>> >> >> and a few other osds in the pg.  In all cases, I was able to
>> >>> >> >> find
>> >>> >> >> the
>> >>> >> >> file
>> >>> >> >> with an identical hash value on all nodes, and I didn't find any
>> >>> >> >> duplicates
>> >>> >> >> or potential orphans.  Also, none of the objects involves have
>> >>> >> >> long
>> >>> >> >> names,
>> >>> >> >> so they're not using the special ceph long filename handling.
>> >>> >> >>
>> >>> >> >> Also, we are not using XFS on our OSDs; we are using ZFS
>> >>> >> >> instead.
>> >>> >> >>
>> >>> >> >> If I'm misunderstanding the issue linked above and the
>> >>> >> >> corresponding
>> >>> >> >> thread,
>> >>> >> >> please let me know.
>> >>> >> >>
>> >>> >> >> Brian
>> >>> >> >>
>> >>> >> >>
>> >>> >> >> On Mon, Jul 25, 2016 at 1:32 PM, Samuel Just <sjust@xxxxxxxxxx>
>> >>> >> >> wrote:
>> >>> >> >>>
>> >>> >> >>> You may have hit http://tracker.ceph.com/issues/14766.  There
>> >>> >> >>> was a
>> >>> >> >>> thread on the list a while back about diagnosing and fixing it.
>> >>> >> >>> -Sam
>> >>> >> >>>
>> >>> >> >>> On Mon, Jul 25, 2016 at 10:45 AM, Brian Felton
>> >>> >> >>> <bjfelton@xxxxxxxxx>
>> >>> >> >>> wrote:
>> >>> >> >>> > Greetings,
>> >>> >> >>> >
>> >>> >> >>> > Problem: After removing (out + crush remove + auth del + osd
>> >>> >> >>> > rm)
>> >>> >> >>> > three
>> >>> >> >>> > osds
>> >>> >> >>> > on a single host, I have six pgs that, after 10 days of
>> >>> >> >>> > recovery,
>> >>> >> >>> > are
>> >>> >> >>> > stuck
>> >>> >> >>> > in a state of
>> >>> >> >>> > active+undersized+degraded+remapped+backfilling.
>> >>> >> >>> >
>> >>> >> >>> > Cluster details:
>> >>> >> >>> >  - 9 hosts (32 cores, 256 GB RAM, Ubuntu 14.04, 72 6TB SAS2
>> >>> >> >>> > drives
>> >>> >> >>> > per
>> >>> >> >>> > host,
>> >>> >> >>> > collocated journals) -- one host now has 69 drives
>> >>> >> >>> >  - Hammer 0.94.6
>> >>> >> >>> >  - object storage use only
>> >>> >> >>> >  - erasure coded (k=7, m=2) .rgw.buckets pool (8192 pgs)
>> >>> >> >>> >  - failure domain of host
>> >>> >> >>> >  - cluster is currently storing 178TB over 260 MObjects (5-6%
>> >>> >> >>> > utilization
>> >>> >> >>> > per OSD)
>> >>> >> >>> >  - all 6 stuck pgs belong to .rgw.buckets
>> >>> >> >>> >
>> >>> >> >>> > The relevant section of our crushmap:
>> >>> >> >>> >
>> >>> >> >>> > rule .rgw.buckets {
>> >>> >> >>> >         ruleset 1
>> >>> >> >>> >         type erasure
>> >>> >> >>> >         min_size 7
>> >>> >> >>> >         max_size 9
>> >>> >> >>> >         step set_chooseleaf_tries 5
>> >>> >> >>> >         step set_choose_tries 250
>> >>> >> >>> >         step take default
>> >>> >> >>> >         step chooseleaf indep 0 type host
>> >>> >> >>> >         step emit
>> >>> >> >>> > }
>> >>> >> >>> >
>> >>> >> >>> > This isn't the first time we've lost a disk (not even the
>> >>> >> >>> > first
>> >>> >> >>> > time
>> >>> >> >>> > we've
>> >>> >> >>> > lost multiple disks on a host in a single event), so we're
>> >>> >> >>> > used
>> >>> >> >>> > to
>> >>> >> >>> > the
>> >>> >> >>> > extended recovery times and understand this is going to be A
>> >>> >> >>> > Thing
>> >>> >> >>> > until
>> >>> >> >>> > we
>> >>> >> >>> > can introduce SSD journals.  This is, however, the first time
>> >>> >> >>> > we've
>> >>> >> >>> > had
>> >>> >> >>> > pgs
>> >>> >> >>> > not return to an active+clean state after a couple days.  As
>> >>> >> >>> > far
>> >>> >> >>> > as
>> >>> >> >>> > I
>> >>> >> >>> > can
>> >>> >> >>> > tell, our cluster is no longer making progress on the
>> >>> >> >>> > backfill
>> >>> >> >>> > operations,
>> >>> >> >>> > and I'm looking for advice on how to get things moving again.
>> >>> >> >>> >
>> >>> >> >>> > Here's a dump of the stuck pgs:
>> >>> >> >>> >
>> >>> >> >>> > ceph pg dump_stuck
>> >>> >> >>> > ok
>> >>> >> >>> > pg_stat state   up      up_primary      acting
>> >>> >> >>> > acting_primary
>> >>> >> >>> > 33.151d active+undersized+degraded+remapped+backfilling
>> >>> >> >>> > [424,546,273,167,471,631,155,38,47]     424
>> >>> >> >>> > [424,546,273,167,471,631,155,38,2147483647]     424
>> >>> >> >>> > 33.6c1  active+undersized+degraded+remapped+backfilling
>> >>> >> >>> > [453,86,565,266,338,580,297,577,404]    453
>> >>> >> >>> > [453,86,565,266,338,2147483647,297,577,404]     453
>> >>> >> >>> > 33.17b7 active+undersized+degraded+remapped+backfilling
>> >>> >> >>> > [399,432,437,541,547,219,229,104,47]    399
>> >>> >> >>> > [399,432,437,541,547,219,229,104,2147483647]    399
>> >>> >> >>> > 33.150d active+undersized+degraded+remapped+backfilling
>> >>> >> >>> > [555,452,511,550,643,431,141,329,486]   555
>> >>> >> >>> > [555,2147483647,511,550,643,431,141,329,486]    555
>> >>> >> >>> > 33.13a8 active+undersized+degraded+remapped+backfilling
>> >>> >> >>> > [507,317,276,617,565,28,471,200,382]    507
>> >>> >> >>> > [507,2147483647,276,617,565,28,471,200,382]     507
>> >>> >> >>> > 33.4c1  active+undersized+degraded+remapped+backfilling
>> >>> >> >>> > [413,440,464,129,641,416,295,266,431]   413
>> >>> >> >>> > [413,440,2147483647,129,641,416,295,266,431]    413
>> >>> >> >>> >
>> >>> >> >>> > Based on a review of previous postings about this issue, I
>> >>> >> >>> > initially
>> >>> >> >>> > suspected that crush couldn't map the pg to an OSD (based on
>> >>> >> >>> > MAX_INT
>> >>> >> >>> > in
>> >>> >> >>> > the
>> >>> >> >>> > acting list), so I increased set_choose_tries from 50 to 200,
>> >>> >> >>> > and
>> >>> >> >>> > then
>> >>> >> >>> > again
>> >>> >> >>> > to 250 just to see if it would do anything.  These changes
>> >>> >> >>> > had no
>> >>> >> >>> > effect
>> >>> >> >>> > that I could discern.
>> >>> >> >>> >
>> >>> >> >>> > I next reviewed the output of ceph pg <pgid> query, and I see
>> >>> >> >>> > something
>> >>> >> >>> > similar to the following for each of my stuck pgs:
>> >>> >> >>> >
>> >>> >> >>> > {
>> >>> >> >>> >     "state":
>> >>> >> >>> > "active+undersized+degraded+remapped+backfilling",
>> >>> >> >>> >     "snap_trimq": "[]",
>> >>> >> >>> >     "epoch": 25211,
>> >>> >> >>> >     "up": [
>> >>> >> >>> >         453,
>> >>> >> >>> >         86,
>> >>> >> >>> >         565,
>> >>> >> >>> >         266,
>> >>> >> >>> >         338,
>> >>> >> >>> >         580,
>> >>> >> >>> >         297,
>> >>> >> >>> >         577,
>> >>> >> >>> >         404
>> >>> >> >>> >     ],
>> >>> >> >>> >     "acting": [
>> >>> >> >>> >         453,
>> >>> >> >>> >         86,
>> >>> >> >>> >         565,
>> >>> >> >>> >         266,
>> >>> >> >>> >         338,
>> >>> >> >>> >         2147483647,
>> >>> >> >>> >         297,
>> >>> >> >>> >         577,
>> >>> >> >>> >         404
>> >>> >> >>> >     ],
>> >>> >> >>> >     "backfill_targets": [
>> >>> >> >>> >         "580(5)"
>> >>> >> >>> >     ],
>> >>> >> >>> >     "actingbackfill": [
>> >>> >> >>> >         "86(1)",
>> >>> >> >>> >         "266(3)",
>> >>> >> >>> >         "297(6)",
>> >>> >> >>> >         "338(4)",
>> >>> >> >>> >         "404(8)",
>> >>> >> >>> >         "453(0)",
>> >>> >> >>> >         "565(2)",
>> >>> >> >>> >         "577(7)",
>> >>> >> >>> >         "580(5)"
>> >>> >> >>> >     ]
>> >>> >> >>> >
>> >>> >> >>> > In this case, 580 is a valid OSD on the node that lost the 3
>> >>> >> >>> > OSDs
>> >>> >> >>> > (node
>> >>> >> >>> > 7).
>> >>> >> >>> > For the other five pgs, the situation is the same -- the
>> >>> >> >>> > backfill
>> >>> >> >>> > target
>> >>> >> >>> > is
>> >>> >> >>> > a valid OSD on node 7.
>> >>> >> >>> >
>> >>> >> >>> > If I dig further into the 'query' output, I encounter the
>> >>> >> >>> > following:
>> >>> >> >>> >
>> >>> >> >>> >     "recovery_state": [
>> >>> >> >>> >         {
>> >>> >> >>> >             "name": "Started\/Primary\/Active",
>> >>> >> >>> >             "enter_time": "2016-07-24 18:52:51.653375",
>> >>> >> >>> >             "might_have_unfound": [],
>> >>> >> >>> >             "recovery_progress": {
>> >>> >> >>> >                 "backfill_targets": [
>> >>> >> >>> >                     "580(5)"
>> >>> >> >>> >                 ],
>> >>> >> >>> >                 "waiting_on_backfill": [
>> >>> >> >>> >                     "580(5)"
>> >>> >> >>> >                 ],
>> >>> >> >>> >                 "last_backfill_started":
>> >>> >> >>> > "981926c1\/default.421929.15_MY_OBJECT",
>> >>> >> >>> >                 "backfill_info": {
>> >>> >> >>> >                     "begin":
>> >>> >> >>> > "391926c1\/default.9468.416_0080a34a\/head\/\/33",
>> >>> >> >>> >                     "end":
>> >>> >> >>> >
>> >>> >> >>> >
>> >>> >> >>> >
>> >>> >> >>> >
>> >>> >> >>> > "464b26c1\/default.284327.111_MBS-b965c481-4320-439b-ad56-9e4212c2fe8f\/CBB_WWTXPVDHCP\/C:\/Windows\/WinSxS\/amd64_wialx00a.inf_31bf3856ad364e35_6.3.9600.17415_none_b2e446f1befcebe5\/LXAA2DeviceDescription.xml:\/20150924104532\/LXAA2DeviceDescription.xml\/head\/\/33",
>> >>> >> >>> >                     "objects": [
>> >>> >> >>> >                         {
>> >>> >> >>> >                             "object":
>> >>> >> >>> > "391926c1\/default.9468.416_0080a34a\/head\/\/33",
>> >>> >> >>> >                             "version": "5356'86333"
>> >>> >> >>> >                         },
>> >>> >> >>> > ...
>> >>> >> >>> >
>> >>> >> >>> > Trying to understand what was going on, I shut off client
>> >>> >> >>> > traffic
>> >>> >> >>> > to
>> >>> >> >>> > the
>> >>> >> >>> > cluster and set debug_osd 20 debug_ms 1 on osd.580.  I see
>> >>> >> >>> > the
>> >>> >> >>> > following
>> >>> >> >>> > section repeated ad infinitum:
>> >>> >> >>> >
>> >>> >> >>> > === BEGIN LOG ===
>> >>> >> >>> >
>> >>> >> >>> > 2016-07-25 15:56:12.682241 7f262e8ed700  1 --
>> >>> >> >>> > 10.54.10.27:6818/913781
>> >>> >> >>> > <==
>> >>> >> >>> > osd.453 10.54.10.28:7010/1375782 236358 ====
>> >>> >> >>> > pg_scan(get_digest
>> >>> >> >>> > 33.6c1s5
>> >>> >> >>> > 981926c1/default.421929.15_MY_OBJECT-0//0//33 e 25226/25226)
>> >>> >> >>> > v2
>> >>> >> >>> > ====
>> >>> >> >>> > 309+0+0
>> >>> >> >>> > (3953350617 0 0) 0x3a11d700 con 0x3656c420
>> >>> >> >>> > 2016-07-25 15:56:12.682273 7f262e8ed700 10 osd.580 25226
>> >>> >> >>> > handle_replica_op
>> >>> >> >>> > pg_scan(get_digest 33.6c1s5
>> >>> >> >>> > 981926c1/default.421929.15_MY_OBJECT-0//0//33 e
>> >>> >> >>> > 25226/25226) v2 epoch 25226
>> >>> >> >>> > 2016-07-25 15:56:12.682278 7f262e8ed700 20 osd.580 25226
>> >>> >> >>> > should_share_map
>> >>> >> >>> > osd.453 10.54.10.28:7010/1375782 25226
>> >>> >> >>> > 2016-07-25 15:56:12.682284 7f262e8ed700 15 osd.580 25226
>> >>> >> >>> > enqueue_op
>> >>> >> >>> > 0x3d503600 prio 127 cost 0 latency 0.000051
>> >>> >> >>> > pg_scan(get_digest
>> >>> >> >>> > 33.6c1s5
>> >>> >> >>> > 981926c1/default.421929.15_MY_OBJECT-0//0//33 e 25226/25226)
>> >>> >> >>> > v2
>> >>> >> >>> > 2016-07-25 15:56:12.682325 7f26724d1700 10 osd.580 25226
>> >>> >> >>> > dequeue_op
>> >>> >> >>> > 0x3d503600 prio 127 cost 0 latency 0.000092
>> >>> >> >>> > pg_scan(get_digest
>> >>> >> >>> > 33.6c1s5
>> >>> >> >>> > 981926c1/default.421929.15_MY_OBJECT-0//0//33 e 25226/25226)
>> >>> >> >>> > v2
>> >>> >> >>> > pg
>> >>> >> >>> > pg[33.6c1s5( v 25226'149584 (5459'139410,25226'149584] lb
>> >>> >> >>> > 981926c1/default.421929.15_MY_OBJECT local-les=5635 n=33203
>> >>> >> >>> > ec=390
>> >>> >> >>> > les/c
>> >>> >> >>> > 5635/25223 25224/25225/25001)
>> >>> >> >>> >
>> >>> >> >>> >
>> >>> >> >>> >
>> >>> >> >>> >
>> >>> >> >>> > [453,86,565,266,338,580,297,577,404]/[453,86,565,266,338,2147483647,297,577,404]
>> >>> >> >>> > r=-1 lpr=25225 pi=5460-25224/117 luod=0'0 crt=25226'149584
>> >>> >> >>> > active+remapped]
>> >>> >> >>> > 2016-07-25 15:56:12.682353 7f26724d1700 10 osd.580 pg_epoch:
>> >>> >> >>> > 25226
>> >>> >> >>> > pg[33.6c1s5( v 25226'149584 (5459'139410,25226'149584] lb
>> >>> >> >>> > 981926c1/default.421929.15_MY_OBJECT local-les=5635 n=33203
>> >>> >> >>> > ec=390
>> >>> >> >>> > les/c
>> >>> >> >>> > 5635/25223 25224/25225/25001)
>> >>> >> >>> >
>> >>> >> >>> >
>> >>> >> >>> >
>> >>> >> >>> >
>> >>> >> >>> > [453,86,565,266,338,580,297,577,404]/[453,86,565,266,338,2147483647,297,577,404]
>> >>> >> >>> > r=-1 lpr=25225 pi=5460-25224/117 luod=0'0 crt=25226'149584
>> >>> >> >>> > active+remapped]
>> >>> >> >>> > handle_message: pg_scan(get_digest 33.6c1s5
>> >>> >> >>> > 981926c1/default.421929.15_MY_OBJECT-0//0//33 e 25226/25226)
>> >>> >> >>> > v2
>> >>> >> >>> > 2016-07-25 15:56:12.682366 7f26724d1700 10 osd.580 pg_epoch:
>> >>> >> >>> > 25226
>> >>> >> >>> > pg[33.6c1s5( v 25226'149584 (5459'139410,25226'149584] lb
>> >>> >> >>> > 981926c1/default.421929.15_MY_OBJECT local-les=5635 n=33203
>> >>> >> >>> > ec=390
>> >>> >> >>> > les/c
>> >>> >> >>> > 5635/25223 25224/25225/25001)
>> >>> >> >>> >
>> >>> >> >>> >
>> >>> >> >>> >
>> >>> >> >>> >
>> >>> >> >>> > [453,86,565,266,338,580,297,577,404]/[453,86,565,266,338,2147483647,297,577,404]
>> >>> >> >>> > r=-1 lpr=25225 pi=5460-25224/117 luod=0'0 crt=25226'149584
>> >>> >> >>> > active+remapped]
>> >>> >> >>> > do_scan pg_scan(get_digest 33.6c1s5
>> >>> >> >>> > 981926c1/default.421929.15_MY_OBJECT-0//0//33 e 25226/25226)
>> >>> >> >>> > v2
>> >>> >> >>> > 2016-07-25 15:56:12.682377 7f26724d1700 10 osd.580 pg_epoch:
>> >>> >> >>> > 25226
>> >>> >> >>> > pg[33.6c1s5( v 25226'149584 (5459'139410,25226'149584] lb
>> >>> >> >>> > 981926c1/default.421929.15_MY_OBJECT local-les=5635 n=33203
>> >>> >> >>> > ec=390
>> >>> >> >>> > les/c
>> >>> >> >>> > 5635/25223 25224/25225/25001)
>> >>> >> >>> >
>> >>> >> >>> >
>> >>> >> >>> >
>> >>> >> >>> >
>> >>> >> >>> > [453,86,565,266,338,580,297,577,404]/[453,86,565,266,338,2147483647,297,577,404]
>> >>> >> >>> > r=-1 lpr=25225 pi=5460-25224/117 luod=0'0 crt=25226'149584
>> >>> >> >>> > active+remapped]
>> >>> >> >>> > scan_range from 981926c1/default.421929.15_MY_OBJECT
>> >>> >> >>> > 2016-07-25 15:56:12.694086 7f26724d1700 10 osd.580 pg_epoch:
>> >>> >> >>> > 25226
>> >>> >> >>> > pg[33.6c1s5( v 25226'149584 (5459'139410,25226'149584] lb
>> >>> >> >>> > 981926c1/default.421929.15_MY_OBJECT local-les=5635 n=33203
>> >>> >> >>> > ec=390
>> >>> >> >>> > les/c
>> >>> >> >>> > 5635/25223 25224/25225/25001)
>> >>> >> >>> >
>> >>> >> >>> >
>> >>> >> >>> >
>> >>> >> >>> >
>> >>> >> >>> > [453,86,565,266,338,580,297,577,404]/[453,86,565,266,338,2147483647,297,577,404]
>> >>> >> >>> > r=-1 lpr=25225 pi=5460-25224/117 luod=0'0 crt=25226'149584
>> >>> >> >>> > active+remapped]
>> >>> >> >>> > got 0 items, next 981926c1/default.421929.15_MY_OBJECT
>> >>> >> >>> > 2016-07-25 15:56:12.694113 7f26724d1700 20 osd.580 pg_epoch:
>> >>> >> >>> > 25226
>> >>> >> >>> > pg[33.6c1s5( v 25226'149584 (5459'139410,25226'149584] lb
>> >>> >> >>> > 981926c1/default.421929.15_MY_OBJECT local-les=5635 n=33203
>> >>> >> >>> > ec=390
>> >>> >> >>> > les/c
>> >>> >> >>> > 5635/25223 25224/25225/25001)
>> >>> >> >>> >
>> >>> >> >>> >
>> >>> >> >>> >
>> >>> >> >>> >
>> >>> >> >>> > [453,86,565,266,338,580,297,577,404]/[453,86,565,266,338,2147483647,297,577,404]
>> >>> >> >>> > r=-1 lpr=25225 pi=5460-25224/117 luod=0'0 crt=25226'149584
>> >>> >> >>> > active+remapped]
>> >>> >> >>> > []
>> >>> >> >>> > 2016-07-25 15:56:12.694129 7f26724d1700  1 --
>> >>> >> >>> > 10.54.10.27:6818/913781
>> >>> >> >>> > -->
>> >>> >> >>> > 10.54.10.28:7010/1375782 -- pg_scan(digest 33.6c1s0
>> >>> >> >>> >
>> >>> >> >>> >
>> >>> >> >>> >
>> >>> >> >>> >
>> >>> >> >>> > 981926c1/default.421929.15_MY_OBJECT-981926c1/default.421929.15_MY_OBJECT e
>> >>> >> >>> > 25226/25226) v2 -- ?+4 0x3a7b7200 con 0x3656c420
>> >>> >> >>> > 2016-07-25 15:56:12.694233 7f26724d1700 10 osd.580 25226
>> >>> >> >>> > dequeue_op
>> >>> >> >>> > 0x3d503600 finish
>> >>> >> >>> >
>> >>> >> >>> > === END LOG ===
>> >>> >> >>> >
>> >>> >> >>> > I'm in the process of digging through the OSD code to
>> >>> >> >>> > understand
>> >>> >> >>> > what's
>> >>> >> >>> > going on here, but I figured I would reach out to the
>> >>> >> >>> > community
>> >>> >> >>> > in
>> >>> >> >>> > the
>> >>> >> >>> > hopes
>> >>> >> >>> > that someone could point me in the right direction.  If
>> >>> >> >>> > anyone
>> >>> >> >>> > has
>> >>> >> >>> > seen
>> >>> >> >>> > this
>> >>> >> >>> > before and can recommend a course of action, I'm all ears.
>> >>> >> >>> > And
>> >>> >> >>> > if
>> >>> >> >>> > there's
>> >>> >> >>> > any other information I can provide, please let me know what
>> >>> >> >>> > else
>> >>> >> >>> > would
>> >>> >> >>> > be
>> >>> >> >>> > helpful.
>> >>> >> >>> >
>> >>> >> >>> > Many thanks to any who can lend a hand or teach a man to
>> >>> >> >>> > fish.
>> >>> >> >>> >
>> >>> >> >>> > Brian Felton
>> >>> >> >>> >
>> >>> >> >>> > _______________________________________________
>> >>> >> >>> > ceph-users mailing list
>> >>> >> >>> > ceph-users@xxxxxxxxxxxxxx
>> >>> >> >>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >>> >> >>> >
>> >>> >> >>
>> >>> >> >>
>> >>> >
>> >>> >
>> >>
>> >>
>
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com