Re: Backfilling pgs not making progress

Samuel Just <sjust@xxxxxxxxxx> · Mon, 25 Jul 2016 13:33:49 -0700



The next thing I'd want is for you to reproduce with

debug osd = 20
debug filestore = 20
debug ms = 1

and post the file somewhere.
-Sam

On Mon, Jul 25, 2016 at 1:33 PM, Samuel Just <sjust@xxxxxxxxxx> wrote:
> If you don't have the orphaned file link, it's not the same bug.
> -Sam
>
> On Mon, Jul 25, 2016 at 12:55 PM, Brian Felton <bjfelton@xxxxxxxxx> wrote:
>> Sam,
>>
>> I'm reviewing that thread now, but I'm not seeing a lot of overlap with my
>> cluster's situation.  For one, I am unable to start either a repair or a
>> deep scrub on any of the affected pgs.  I've instructed all six of the pgs
>> to scrub, deep-scrub, and repair, and the cluster has been gleefully
>> ignoring these requests (it has been several hours since I first tried, and
>> the logs indicate none of the pgs ever scrubbed).  Second, none of the my
>> OSDs is crashing.  Third, none of my pgs or objects has ever been marked
>> inconsistent (or unfound, for that matter) -- I'm only seeing the standard
>> mix of degraded/misplaced objects that are common during a recovery.  What
>> I'm not seeing is any further progress on the number of misplaced objects --
>> the number has remained effectively unchanged for the past several days.
>>
>> To be sure, though, I tracked down the file that the backfill operation
>> seems to be hung on, and I can find it in both the backfill target osd (580)
>> and a few other osds in the pg.  In all cases, I was able to find the file
>> with an identical hash value on all nodes, and I didn't find any duplicates
>> or potential orphans.  Also, none of the objects involves have long names,
>> so they're not using the special ceph long filename handling.
>>
>> Also, we are not using XFS on our OSDs; we are using ZFS instead.
>>
>> If I'm misunderstanding the issue linked above and the corresponding thread,
>> please let me know.
>>
>> Brian
>>
>>
>> On Mon, Jul 25, 2016 at 1:32 PM, Samuel Just <sjust@xxxxxxxxxx> wrote:
>>>
>>> You may have hit http://tracker.ceph.com/issues/14766.  There was a
>>> thread on the list a while back about diagnosing and fixing it.
>>> -Sam
>>>
>>> On Mon, Jul 25, 2016 at 10:45 AM, Brian Felton <bjfelton@xxxxxxxxx> wrote:
>>> > Greetings,
>>> >
>>> > Problem: After removing (out + crush remove + auth del + osd rm) three
>>> > osds
>>> > on a single host, I have six pgs that, after 10 days of recovery, are
>>> > stuck
>>> > in a state of active+undersized+degraded+remapped+backfilling.
>>> >
>>> > Cluster details:
>>> >  - 9 hosts (32 cores, 256 GB RAM, Ubuntu 14.04, 72 6TB SAS2 drives per
>>> > host,
>>> > collocated journals) -- one host now has 69 drives
>>> >  - Hammer 0.94.6
>>> >  - object storage use only
>>> >  - erasure coded (k=7, m=2) .rgw.buckets pool (8192 pgs)
>>> >  - failure domain of host
>>> >  - cluster is currently storing 178TB over 260 MObjects (5-6%
>>> > utilization
>>> > per OSD)
>>> >  - all 6 stuck pgs belong to .rgw.buckets
>>> >
>>> > The relevant section of our crushmap:
>>> >
>>> > rule .rgw.buckets {
>>> >         ruleset 1
>>> >         type erasure
>>> >         min_size 7
>>> >         max_size 9
>>> >         step set_chooseleaf_tries 5
>>> >         step set_choose_tries 250
>>> >         step take default
>>> >         step chooseleaf indep 0 type host
>>> >         step emit
>>> > }
>>> >
>>> > This isn't the first time we've lost a disk (not even the first time
>>> > we've
>>> > lost multiple disks on a host in a single event), so we're used to the
>>> > extended recovery times and understand this is going to be A Thing until
>>> > we
>>> > can introduce SSD journals.  This is, however, the first time we've had
>>> > pgs
>>> > not return to an active+clean state after a couple days.  As far as I
>>> > can
>>> > tell, our cluster is no longer making progress on the backfill
>>> > operations,
>>> > and I'm looking for advice on how to get things moving again.
>>> >
>>> > Here's a dump of the stuck pgs:
>>> >
>>> > ceph pg dump_stuck
>>> > ok
>>> > pg_stat state   up      up_primary      acting  acting_primary
>>> > 33.151d active+undersized+degraded+remapped+backfilling
>>> > [424,546,273,167,471,631,155,38,47]     424
>>> > [424,546,273,167,471,631,155,38,2147483647]     424
>>> > 33.6c1  active+undersized+degraded+remapped+backfilling
>>> > [453,86,565,266,338,580,297,577,404]    453
>>> > [453,86,565,266,338,2147483647,297,577,404]     453
>>> > 33.17b7 active+undersized+degraded+remapped+backfilling
>>> > [399,432,437,541,547,219,229,104,47]    399
>>> > [399,432,437,541,547,219,229,104,2147483647]    399
>>> > 33.150d active+undersized+degraded+remapped+backfilling
>>> > [555,452,511,550,643,431,141,329,486]   555
>>> > [555,2147483647,511,550,643,431,141,329,486]    555
>>> > 33.13a8 active+undersized+degraded+remapped+backfilling
>>> > [507,317,276,617,565,28,471,200,382]    507
>>> > [507,2147483647,276,617,565,28,471,200,382]     507
>>> > 33.4c1  active+undersized+degraded+remapped+backfilling
>>> > [413,440,464,129,641,416,295,266,431]   413
>>> > [413,440,2147483647,129,641,416,295,266,431]    413
>>> >
>>> > Based on a review of previous postings about this issue, I initially
>>> > suspected that crush couldn't map the pg to an OSD (based on MAX_INT in
>>> > the
>>> > acting list), so I increased set_choose_tries from 50 to 200, and then
>>> > again
>>> > to 250 just to see if it would do anything.  These changes had no effect
>>> > that I could discern.
>>> >
>>> > I next reviewed the output of ceph pg <pgid> query, and I see something
>>> > similar to the following for each of my stuck pgs:
>>> >
>>> > {
>>> >     "state": "active+undersized+degraded+remapped+backfilling",
>>> >     "snap_trimq": "[]",
>>> >     "epoch": 25211,
>>> >     "up": [
>>> >         453,
>>> >         86,
>>> >         565,
>>> >         266,
>>> >         338,
>>> >         580,
>>> >         297,
>>> >         577,
>>> >         404
>>> >     ],
>>> >     "acting": [
>>> >         453,
>>> >         86,
>>> >         565,
>>> >         266,
>>> >         338,
>>> >         2147483647,
>>> >         297,
>>> >         577,
>>> >         404
>>> >     ],
>>> >     "backfill_targets": [
>>> >         "580(5)"
>>> >     ],
>>> >     "actingbackfill": [
>>> >         "86(1)",
>>> >         "266(3)",
>>> >         "297(6)",
>>> >         "338(4)",
>>> >         "404(8)",
>>> >         "453(0)",
>>> >         "565(2)",
>>> >         "577(7)",
>>> >         "580(5)"
>>> >     ]
>>> >
>>> > In this case, 580 is a valid OSD on the node that lost the 3 OSDs (node
>>> > 7).
>>> > For the other five pgs, the situation is the same -- the backfill target
>>> > is
>>> > a valid OSD on node 7.
>>> >
>>> > If I dig further into the 'query' output, I encounter the following:
>>> >
>>> >     "recovery_state": [
>>> >         {
>>> >             "name": "Started\/Primary\/Active",
>>> >             "enter_time": "2016-07-24 18:52:51.653375",
>>> >             "might_have_unfound": [],
>>> >             "recovery_progress": {
>>> >                 "backfill_targets": [
>>> >                     "580(5)"
>>> >                 ],
>>> >                 "waiting_on_backfill": [
>>> >                     "580(5)"
>>> >                 ],
>>> >                 "last_backfill_started":
>>> > "981926c1\/default.421929.15_MY_OBJECT",
>>> >                 "backfill_info": {
>>> >                     "begin":
>>> > "391926c1\/default.9468.416_0080a34a\/head\/\/33",
>>> >                     "end":
>>> >
>>> > "464b26c1\/default.284327.111_MBS-b965c481-4320-439b-ad56-9e4212c2fe8f\/CBB_WWTXPVDHCP\/C:\/Windows\/WinSxS\/amd64_wialx00a.inf_31bf3856ad364e35_6.3.9600.17415_none_b2e446f1befcebe5\/LXAA2DeviceDescription.xml:\/20150924104532\/LXAA2DeviceDescription.xml\/head\/\/33",
>>> >                     "objects": [
>>> >                         {
>>> >                             "object":
>>> > "391926c1\/default.9468.416_0080a34a\/head\/\/33",
>>> >                             "version": "5356'86333"
>>> >                         },
>>> > ...
>>> >
>>> > Trying to understand what was going on, I shut off client traffic to the
>>> > cluster and set debug_osd 20 debug_ms 1 on osd.580.  I see the following
>>> > section repeated ad infinitum:
>>> >
>>> > === BEGIN LOG ===
>>> >
>>> > 2016-07-25 15:56:12.682241 7f262e8ed700  1 -- 10.54.10.27:6818/913781
>>> > <==
>>> > osd.453 10.54.10.28:7010/1375782 236358 ==== pg_scan(get_digest 33.6c1s5
>>> > 981926c1/default.421929.15_MY_OBJECT-0//0//33 e 25226/25226) v2 ====
>>> > 309+0+0
>>> > (3953350617 0 0) 0x3a11d700 con 0x3656c420
>>> > 2016-07-25 15:56:12.682273 7f262e8ed700 10 osd.580 25226
>>> > handle_replica_op
>>> > pg_scan(get_digest 33.6c1s5
>>> > 981926c1/default.421929.15_MY_OBJECT-0//0//33 e
>>> > 25226/25226) v2 epoch 25226
>>> > 2016-07-25 15:56:12.682278 7f262e8ed700 20 osd.580 25226
>>> > should_share_map
>>> > osd.453 10.54.10.28:7010/1375782 25226
>>> > 2016-07-25 15:56:12.682284 7f262e8ed700 15 osd.580 25226 enqueue_op
>>> > 0x3d503600 prio 127 cost 0 latency 0.000051 pg_scan(get_digest 33.6c1s5
>>> > 981926c1/default.421929.15_MY_OBJECT-0//0//33 e 25226/25226) v2
>>> > 2016-07-25 15:56:12.682325 7f26724d1700 10 osd.580 25226 dequeue_op
>>> > 0x3d503600 prio 127 cost 0 latency 0.000092 pg_scan(get_digest 33.6c1s5
>>> > 981926c1/default.421929.15_MY_OBJECT-0//0//33 e 25226/25226) v2 pg
>>> > pg[33.6c1s5( v 25226'149584 (5459'139410,25226'149584] lb
>>> > 981926c1/default.421929.15_MY_OBJECT local-les=5635 n=33203 ec=390 les/c
>>> > 5635/25223 25224/25225/25001)
>>> >
>>> > [453,86,565,266,338,580,297,577,404]/[453,86,565,266,338,2147483647,297,577,404]
>>> > r=-1 lpr=25225 pi=5460-25224/117 luod=0'0 crt=25226'149584
>>> > active+remapped]
>>> > 2016-07-25 15:56:12.682353 7f26724d1700 10 osd.580 pg_epoch: 25226
>>> > pg[33.6c1s5( v 25226'149584 (5459'139410,25226'149584] lb
>>> > 981926c1/default.421929.15_MY_OBJECT local-les=5635 n=33203 ec=390 les/c
>>> > 5635/25223 25224/25225/25001)
>>> >
>>> > [453,86,565,266,338,580,297,577,404]/[453,86,565,266,338,2147483647,297,577,404]
>>> > r=-1 lpr=25225 pi=5460-25224/117 luod=0'0 crt=25226'149584
>>> > active+remapped]
>>> > handle_message: pg_scan(get_digest 33.6c1s5
>>> > 981926c1/default.421929.15_MY_OBJECT-0//0//33 e 25226/25226) v2
>>> > 2016-07-25 15:56:12.682366 7f26724d1700 10 osd.580 pg_epoch: 25226
>>> > pg[33.6c1s5( v 25226'149584 (5459'139410,25226'149584] lb
>>> > 981926c1/default.421929.15_MY_OBJECT local-les=5635 n=33203 ec=390 les/c
>>> > 5635/25223 25224/25225/25001)
>>> >
>>> > [453,86,565,266,338,580,297,577,404]/[453,86,565,266,338,2147483647,297,577,404]
>>> > r=-1 lpr=25225 pi=5460-25224/117 luod=0'0 crt=25226'149584
>>> > active+remapped]
>>> > do_scan pg_scan(get_digest 33.6c1s5
>>> > 981926c1/default.421929.15_MY_OBJECT-0//0//33 e 25226/25226) v2
>>> > 2016-07-25 15:56:12.682377 7f26724d1700 10 osd.580 pg_epoch: 25226
>>> > pg[33.6c1s5( v 25226'149584 (5459'139410,25226'149584] lb
>>> > 981926c1/default.421929.15_MY_OBJECT local-les=5635 n=33203 ec=390 les/c
>>> > 5635/25223 25224/25225/25001)
>>> >
>>> > [453,86,565,266,338,580,297,577,404]/[453,86,565,266,338,2147483647,297,577,404]
>>> > r=-1 lpr=25225 pi=5460-25224/117 luod=0'0 crt=25226'149584
>>> > active+remapped]
>>> > scan_range from 981926c1/default.421929.15_MY_OBJECT
>>> > 2016-07-25 15:56:12.694086 7f26724d1700 10 osd.580 pg_epoch: 25226
>>> > pg[33.6c1s5( v 25226'149584 (5459'139410,25226'149584] lb
>>> > 981926c1/default.421929.15_MY_OBJECT local-les=5635 n=33203 ec=390 les/c
>>> > 5635/25223 25224/25225/25001)
>>> >
>>> > [453,86,565,266,338,580,297,577,404]/[453,86,565,266,338,2147483647,297,577,404]
>>> > r=-1 lpr=25225 pi=5460-25224/117 luod=0'0 crt=25226'149584
>>> > active+remapped]
>>> > got 0 items, next 981926c1/default.421929.15_MY_OBJECT
>>> > 2016-07-25 15:56:12.694113 7f26724d1700 20 osd.580 pg_epoch: 25226
>>> > pg[33.6c1s5( v 25226'149584 (5459'139410,25226'149584] lb
>>> > 981926c1/default.421929.15_MY_OBJECT local-les=5635 n=33203 ec=390 les/c
>>> > 5635/25223 25224/25225/25001)
>>> >
>>> > [453,86,565,266,338,580,297,577,404]/[453,86,565,266,338,2147483647,297,577,404]
>>> > r=-1 lpr=25225 pi=5460-25224/117 luod=0'0 crt=25226'149584
>>> > active+remapped]
>>> > []
>>> > 2016-07-25 15:56:12.694129 7f26724d1700  1 -- 10.54.10.27:6818/913781
>>> > -->
>>> > 10.54.10.28:7010/1375782 -- pg_scan(digest 33.6c1s0
>>> >
>>> > 981926c1/default.421929.15_MY_OBJECT-981926c1/default.421929.15_MY_OBJECT e
>>> > 25226/25226) v2 -- ?+4 0x3a7b7200 con 0x3656c420
>>> > 2016-07-25 15:56:12.694233 7f26724d1700 10 osd.580 25226 dequeue_op
>>> > 0x3d503600 finish
>>> >
>>> > === END LOG ===
>>> >
>>> > I'm in the process of digging through the OSD code to understand what's
>>> > going on here, but I figured I would reach out to the community in the
>>> > hopes
>>> > that someone could point me in the right direction.  If anyone has seen
>>> > this
>>> > before and can recommend a course of action, I'm all ears.  And if
>>> > there's
>>> > any other information I can provide, please let me know what else would
>>> > be
>>> > helpful.
>>> >
>>> > Many thanks to any who can lend a hand or teach a man to fish.
>>> >
>>> > Brian Felton
>>> >
>>> > _______________________________________________
>>> > ceph-users mailing list
>>> > ceph-users@xxxxxxxxxxxxxx
>>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> >
>>
>>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com