Re: Backfilling pgs not making progress

Brian Felton <bjfelton@xxxxxxxxx> · Mon, 25 Jul 2016 23:04:02 -0500

Sam,

I cranked up the logging on the backfill target (osd 580 on node 07) and the acting primary for the pg (453 on node 08, for what it's worth).  The logs from the primary are very large, so pardon the tarballs.

PG Primary Logs: https://www.dropbox.com/s/ipjobn2i5ban9km/backfill-primary-log.tgz?dl=0B
PG Backfill Target Logs: https://www.dropbox.com/s/9qpiqsnahx0qc5k/backfill-target-log.tgz?dl=0

I'll be reviewing them with my team tomorrow morning to see if we can find anything.  Thanks for your assistance.

Brian

On Mon, Jul 25, 2016 at 3:33 PM, Samuel Just <sjust@xxxxxxxxxx> wrote:
The next thing I'd want is for you to reproduce with

debug osd = 20

debug filestore = 20

debug ms = 1

and post the file somewhere.

-Sam

On Mon, Jul 25, 2016 at 1:33 PM, Samuel Just <sjust@xxxxxxxxxx> wrote:

> If you don't have the orphaned file link, it's not the same bug.

> -Sam

>

> On Mon, Jul 25, 2016 at 12:55 PM, Brian Felton <bjfelton@xxxxxxxxx> wrote:

>> Sam,

>>

>> I'm reviewing that thread now, but I'm not seeing a lot of overlap with my

>> cluster's situation.  For one, I am unable to start either a repair or a

>> deep scrub on any of the affected pgs.  I've instructed all six of the pgs

>> to scrub, deep-scrub, and repair, and the cluster has been gleefully

>> ignoring these requests (it has been several hours since I first tried, and

>> the logs indicate none of the pgs ever scrubbed).  Second, none of the my

>> OSDs is crashing.  Third, none of my pgs or objects has ever been marked

>> inconsistent (or unfound, for that matter) -- I'm only seeing the standard

>> mix of degraded/misplaced objects that are common during a recovery.  What

>> I'm not seeing is any further progress on the number of misplaced objects --

>> the number has remained effectively unchanged for the past several days.

>>

>> To be sure, though, I tracked down the file that the backfill operation

>> seems to be hung on, and I can find it in both the backfill target osd (580)

>> and a few other osds in the pg.  In all cases, I was able to find the file

>> with an identical hash value on all nodes, and I didn't find any duplicates

>> or potential orphans.  Also, none of the objects involves have long names,

>> so they're not using the special ceph long filename handling.

>>

>> Also, we are not using XFS on our OSDs; we are using ZFS instead.

>>

>> If I'm misunderstanding the issue linked above and the corresponding thread,

>> please let me know.

>>

>> Brian

>>

>>

>> On Mon, Jul 25, 2016 at 1:32 PM, Samuel Just <sjust@xxxxxxxxxx> wrote:

>>>

>>> You may have hit http://tracker.ceph.com/issues/14766.  There was a

>>> thread on the list a while back about diagnosing and fixing it.

>>> -Sam

>>>

>>> On Mon, Jul 25, 2016 at 10:45 AM, Brian Felton <bjfelton@xxxxxxxxx> wrote:

>>> > Greetings,

>>> >

>>> > Problem: After removing (out + crush remove + auth del + osd rm) three

>>> > osds

>>> > on a single host, I have six pgs that, after 10 days of recovery, are

>>> > stuck

>>> > in a state of active+undersized+degraded+remapped+backfilling.

>>> >

>>> > Cluster details:

>>> >  - 9 hosts (32 cores, 256 GB RAM, Ubuntu 14.04, 72 6TB SAS2 drives per

>>> > host,

>>> > collocated journals) -- one host now has 69 drives

>>> >  - Hammer 0.94.6

>>> >  - object storage use only

>>> >  - erasure coded (k=7, m=2) .rgw.buckets pool (8192 pgs)

>>> >  - failure domain of host

>>> >  - cluster is currently storing 178TB over 260 MObjects (5-6%

>>> > utilization

>>> > per OSD)

>>> >  - all 6 stuck pgs belong to .rgw.buckets

>>> >

>>> > The relevant section of our crushmap:

>>> >

>>> > rule .rgw.buckets {

>>> >         ruleset 1

>>> >         type erasure

>>> >         min_size 7

>>> >         max_size 9

>>> >         step set_chooseleaf_tries 5

>>> >         step set_choose_tries 250

>>> >         step take default

>>> >         step chooseleaf indep 0 type host

>>> >         step emit

>>> > }

>>> >

>>> > This isn't the first time we've lost a disk (not even the first time

>>> > we've

>>> > lost multiple disks on a host in a single event), so we're used to the

>>> > extended recovery times and understand this is going to be A Thing until

>>> > we

>>> > can introduce SSD journals.  This is, however, the first time we've had

>>> > pgs

>>> > not return to an active+clean state after a couple days.  As far as I

>>> > can

>>> > tell, our cluster is no longer making progress on the backfill

>>> > operations,

>>> > and I'm looking for advice on how to get things moving again.

>>> >

>>> > Here's a dump of the stuck pgs:

>>> >

>>> > ceph pg dump_stuck

>>> > ok

>>> > pg_stat state   up      up_primary      acting  acting_primary

>>> > 33.151d active+undersized+degraded+remapped+backfilling

>>> > [424,546,273,167,471,631,155,38,47]     424

>>> > [424,546,273,167,471,631,155,38,2147483647]     424

>>> > 33.6c1  active+undersized+degraded+remapped+backfilling

>>> > [453,86,565,266,338,580,297,577,404]    453

>>> > [453,86,565,266,338,2147483647,297,577,404]     453

>>> > 33.17b7 active+undersized+degraded+remapped+backfilling

>>> > [399,432,437,541,547,219,229,104,47]    399

>>> > [399,432,437,541,547,219,229,104,2147483647]    399

>>> > 33.150d active+undersized+degraded+remapped+backfilling

>>> > [555,452,511,550,643,431,141,329,486]   555

>>> > [555,2147483647,511,550,643,431,141,329,486]    555

>>> > 33.13a8 active+undersized+degraded+remapped+backfilling

>>> > [507,317,276,617,565,28,471,200,382]    507

>>> > [507,2147483647,276,617,565,28,471,200,382]     507

>>> > 33.4c1  active+undersized+degraded+remapped+backfilling

>>> > [413,440,464,129,641,416,295,266,431]   413

>>> > [413,440,2147483647,129,641,416,295,266,431]    413

>>> >

>>> > Based on a review of previous postings about this issue, I initially

>>> > suspected that crush couldn't map the pg to an OSD (based on MAX_INT in

>>> > the

>>> > acting list), so I increased set_choose_tries from 50 to 200, and then

>>> > again

>>> > to 250 just to see if it would do anything.  These changes had no effect

>>> > that I could discern.

>>> >

>>> > I next reviewed the output of ceph pg <pgid> query, and I see something

>>> > similar to the following for each of my stuck pgs:

>>> >

>>> > {

>>> >     "state": "active+undersized+degraded+remapped+backfilling",

>>> >     "snap_trimq": "[]",

>>> >     "epoch": 25211,

>>> >     "up": [

>>> >         453,

>>> >         86,

>>> >         565,

>>> >         266,

>>> >         338,

>>> >         580,

>>> >         297,

>>> >         577,

>>> >         404

>>> >     ],

>>> >     "acting": [

>>> >         453,

>>> >         86,

>>> >         565,

>>> >         266,

>>> >         338,

>>> >         2147483647,

>>> >         297,

>>> >         577,

>>> >         404

>>> >     ],

>>> >     "backfill_targets": [

>>> >         "580(5)"

>>> >     ],

>>> >     "actingbackfill": [

>>> >         "86(1)",

>>> >         "266(3)",

>>> >         "297(6)",

>>> >         "338(4)",

>>> >         "404(8)",

>>> >         "453(0)",

>>> >         "565(2)",

>>> >         "577(7)",

>>> >         "580(5)"

>>> >     ]

>>> >

>>> > In this case, 580 is a valid OSD on the node that lost the 3 OSDs (node

>>> > 7).

>>> > For the other five pgs, the situation is the same -- the backfill target

>>> > is

>>> > a valid OSD on node 7.

>>> >

>>> > If I dig further into the 'query' output, I encounter the following:

>>> >

>>> >     "recovery_state": [

>>> >         {

>>> >             "name": "Started\/Primary\/Active",

>>> >             "enter_time": "2016-07-24 18:52:51.653375",

>>> >             "might_have_unfound": [],

>>> >             "recovery_progress": {

>>> >                 "backfill_targets": [

>>> >                     "580(5)"

>>> >                 ],

>>> >                 "waiting_on_backfill": [

>>> >                     "580(5)"

>>> >                 ],

>>> >                 "last_backfill_started":

>>> > "981926c1\/default.421929.15_MY_OBJECT",

>>> >                 "backfill_info": {

>>> >                     "begin":

>>> > "391926c1\/default.9468.416_0080a34a\/head\/\/33",

>>> >                     "end":

>>> >

>>> > "464b26c1\/default.284327.111_MBS-b965c481-4320-439b-ad56-9e4212c2fe8f\/CBB_WWTXPVDHCP\/C:\/Windows\/WinSxS\/amd64_wialx00a.inf_31bf3856ad364e35_6.3.9600.17415_none_b2e446f1befcebe5\/LXAA2DeviceDescription.xml:\/20150924104532\/LXAA2DeviceDescription.xml\/head\/\/33",

>>> >                     "objects": [

>>> >                         {

>>> >                             "object":

>>> > "391926c1\/default.9468.416_0080a34a\/head\/\/33",

>>> >                             "version": "5356'86333"

>>> >                         },

>>> > ...

>>> >

>>> > Trying to understand what was going on, I shut off client traffic to the

>>> > cluster and set debug_osd 20 debug_ms 1 on osd.580.  I see the following

>>> > section repeated ad infinitum:

>>> >

>>> > === BEGIN LOG ===

>>> >

>>> > 2016-07-25 15:56:12.682241 7f262e8ed700  1 -- 10.54.10.27:6818/913781

>>> > <==

>>> > osd.453 10.54.10.28:7010/1375782 236358 ==== pg_scan(get_digest 33.6c1s5

>>> > 981926c1/default.421929.15_MY_OBJECT-0//0//33 e 25226/25226) v2 ====

>>> > 309+0+0

>>> > (3953350617 0 0) 0x3a11d700 con 0x3656c420

>>> > 2016-07-25 15:56:12.682273 7f262e8ed700 10 osd.580 25226

>>> > handle_replica_op

>>> > pg_scan(get_digest 33.6c1s5

>>> > 981926c1/default.421929.15_MY_OBJECT-0//0//33 e

>>> > 25226/25226) v2 epoch 25226

>>> > 2016-07-25 15:56:12.682278 7f262e8ed700 20 osd.580 25226

>>> > should_share_map

>>> > osd.453 10.54.10.28:7010/1375782 25226

>>> > 2016-07-25 15:56:12.682284 7f262e8ed700 15 osd.580 25226 enqueue_op

>>> > 0x3d503600 prio 127 cost 0 latency 0.000051 pg_scan(get_digest 33.6c1s5

>>> > 981926c1/default.421929.15_MY_OBJECT-0//0//33 e 25226/25226) v2

>>> > 2016-07-25 15:56:12.682325 7f26724d1700 10 osd.580 25226 dequeue_op

>>> > 0x3d503600 prio 127 cost 0 latency 0.000092 pg_scan(get_digest 33.6c1s5

>>> > 981926c1/default.421929.15_MY_OBJECT-0//0//33 e 25226/25226) v2 pg

>>> > pg[33.6c1s5( v 25226'149584 (5459'139410,25226'149584] lb

>>> > 981926c1/default.421929.15_MY_OBJECT local-les=5635 n=33203 ec=390 les/c

>>> > 5635/25223 25224/25225/25001)

>>> >

>>> > [453,86,565,266,338,580,297,577,404]/[453,86,565,266,338,2147483647,297,577,404]

>>> > r=-1 lpr=25225 pi=5460-25224/117 luod=0'0 crt=25226'149584

>>> > active+remapped]

>>> > 2016-07-25 15:56:12.682353 7f26724d1700 10 osd.580 pg_epoch: 25226

>>> > pg[33.6c1s5( v 25226'149584 (5459'139410,25226'149584] lb

>>> > 981926c1/default.421929.15_MY_OBJECT local-les=5635 n=33203 ec=390 les/c

>>> > 5635/25223 25224/25225/25001)

>>> >

>>> > [453,86,565,266,338,580,297,577,404]/[453,86,565,266,338,2147483647,297,577,404]

>>> > r=-1 lpr=25225 pi=5460-25224/117 luod=0'0 crt=25226'149584

>>> > active+remapped]

>>> > handle_message: pg_scan(get_digest 33.6c1s5

>>> > 981926c1/default.421929.15_MY_OBJECT-0//0//33 e 25226/25226) v2

>>> > 2016-07-25 15:56:12.682366 7f26724d1700 10 osd.580 pg_epoch: 25226

>>> > pg[33.6c1s5( v 25226'149584 (5459'139410,25226'149584] lb

>>> > 981926c1/default.421929.15_MY_OBJECT local-les=5635 n=33203 ec=390 les/c

>>> > 5635/25223 25224/25225/25001)

>>> >

>>> > [453,86,565,266,338,580,297,577,404]/[453,86,565,266,338,2147483647,297,577,404]

>>> > r=-1 lpr=25225 pi=5460-25224/117 luod=0'0 crt=25226'149584

>>> > active+remapped]

>>> > do_scan pg_scan(get_digest 33.6c1s5

>>> > 981926c1/default.421929.15_MY_OBJECT-0//0//33 e 25226/25226) v2

>>> > 2016-07-25 15:56:12.682377 7f26724d1700 10 osd.580 pg_epoch: 25226

>>> > pg[33.6c1s5( v 25226'149584 (5459'139410,25226'149584] lb

>>> > 981926c1/default.421929.15_MY_OBJECT local-les=5635 n=33203 ec=390 les/c

>>> > 5635/25223 25224/25225/25001)

>>> >

>>> > [453,86,565,266,338,580,297,577,404]/[453,86,565,266,338,2147483647,297,577,404]

>>> > r=-1 lpr=25225 pi=5460-25224/117 luod=0'0 crt=25226'149584

>>> > active+remapped]

>>> > scan_range from 981926c1/default.421929.15_MY_OBJECT

>>> > 2016-07-25 15:56:12.694086 7f26724d1700 10 osd.580 pg_epoch: 25226

>>> > pg[33.6c1s5( v 25226'149584 (5459'139410,25226'149584] lb

>>> > 981926c1/default.421929.15_MY_OBJECT local-les=5635 n=33203 ec=390 les/c

>>> > 5635/25223 25224/25225/25001)

>>> >

>>> > [453,86,565,266,338,580,297,577,404]/[453,86,565,266,338,2147483647,297,577,404]

>>> > r=-1 lpr=25225 pi=5460-25224/117 luod=0'0 crt=25226'149584

>>> > active+remapped]

>>> > got 0 items, next 981926c1/default.421929.15_MY_OBJECT

>>> > 2016-07-25 15:56:12.694113 7f26724d1700 20 osd.580 pg_epoch: 25226

>>> > pg[33.6c1s5( v 25226'149584 (5459'139410,25226'149584] lb

>>> > 981926c1/default.421929.15_MY_OBJECT local-les=5635 n=33203 ec=390 les/c

>>> > 5635/25223 25224/25225/25001)

>>> >

>>> > [453,86,565,266,338,580,297,577,404]/[453,86,565,266,338,2147483647,297,577,404]

>>> > r=-1 lpr=25225 pi=5460-25224/117 luod=0'0 crt=25226'149584

>>> > active+remapped]

>>> > []

>>> > 2016-07-25 15:56:12.694129 7f26724d1700  1 -- 10.54.10.27:6818/913781

>>> > -->

>>> > 10.54.10.28:7010/1375782 -- pg_scan(digest 33.6c1s0

>>> >

>>> > 981926c1/default.421929.15_MY_OBJECT-981926c1/default.421929.15_MY_OBJECT e

>>> > 25226/25226) v2 -- ?+4 0x3a7b7200 con 0x3656c420

>>> > 2016-07-25 15:56:12.694233 7f26724d1700 10 osd.580 25226 dequeue_op

>>> > 0x3d503600 finish

>>> >

>>> > === END LOG ===

>>> >

>>> > I'm in the process of digging through the OSD code to understand what's

>>> > going on here, but I figured I would reach out to the community in the

>>> > hopes

>>> > that someone could point me in the right direction.  If anyone has seen

>>> > this

>>> > before and can recommend a course of action, I'm all ears.  And if

>>> > there's

>>> > any other information I can provide, please let me know what else would

>>> > be

>>> > helpful.

>>> >

>>> > Many thanks to any who can lend a hand or teach a man to fish.

>>> >

>>> > Brian Felton

>>> >

>>> > _______________________________________________

>>> > ceph-users mailing list

>>> > ceph-users@xxxxxxxxxxxxxx

>>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>>> >

>>

>>

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com