Re: Backfilling pgs not making progress

Brian Felton <bjfelton@xxxxxxxxx> · Mon, 25 Jul 2016 14:55:55 -0500

Sam,

I'm reviewing that thread now, but I'm not seeing a lot of overlap with my cluster's situation.  For one, I am unable to start either a repair or a deep scrub on any of the affected pgs.  I've instructed all six of the pgs to scrub, deep-scrub, and repair, and the cluster has been gleefully ignoring these requests (it has been several hours since I first tried, and the logs indicate none of the pgs ever scrubbed).  Second, none of the my OSDs is crashing.  Third, none of my pgs or objects has ever been marked inconsistent (or unfound, for that matter) -- I'm only seeing the standard mix of degraded/misplaced objects that are common during a recovery.  What I'm not seeing is any further progress on the number of misplaced objects -- the number has remained effectively unchanged for the past several days.

To be sure, though, I tracked down the file that the backfill operation seems to be hung on, and I can find it in both the backfill target osd (580) and a few other osds in the pg.  In all cases, I was able to find the file with an identical hash value on all nodes, and I didn't find any duplicates or potential orphans.  Also, none of the objects involves have long names, so they're not using the special ceph long filename handling.  

Also, we are not using XFS on our OSDs; we are using ZFS instead.

If I'm misunderstanding the issue linked above and the corresponding thread, please let me know.

Brian

On Mon, Jul 25, 2016 at 1:32 PM, Samuel Just <sjust@xxxxxxxxxx> wrote:
You may have hit http://tracker.ceph.com/issues/14766.  There was a

thread on the list a while back about diagnosing and fixing it.

-Sam

On Mon, Jul 25, 2016 at 10:45 AM, Brian Felton <bjfelton@xxxxxxxxx> wrote:

> Greetings,

>

> Problem: After removing (out + crush remove + auth del + osd rm) three osds

> on a single host, I have six pgs that, after 10 days of recovery, are stuck

> in a state of active+undersized+degraded+remapped+backfilling.

>

> Cluster details:

>  - 9 hosts (32 cores, 256 GB RAM, Ubuntu 14.04, 72 6TB SAS2 drives per host,

> collocated journals) -- one host now has 69 drives

>  - Hammer 0.94.6

>  - object storage use only

>  - erasure coded (k=7, m=2) .rgw.buckets pool (8192 pgs)

>  - failure domain of host

>  - cluster is currently storing 178TB over 260 MObjects (5-6% utilization

> per OSD)

>  - all 6 stuck pgs belong to .rgw.buckets

>

> The relevant section of our crushmap:

>

> rule .rgw.buckets {

>         ruleset 1

>         type erasure

>         min_size 7

>         max_size 9

>         step set_chooseleaf_tries 5

>         step set_choose_tries 250

>         step take default

>         step chooseleaf indep 0 type host

>         step emit

> }

>

> This isn't the first time we've lost a disk (not even the first time we've

> lost multiple disks on a host in a single event), so we're used to the

> extended recovery times and understand this is going to be A Thing until we

> can introduce SSD journals.  This is, however, the first time we've had pgs

> not return to an active+clean state after a couple days.  As far as I can

> tell, our cluster is no longer making progress on the backfill operations,

> and I'm looking for advice on how to get things moving again.

>

> Here's a dump of the stuck pgs:

>

> ceph pg dump_stuck

> ok

> pg_stat state   up      up_primary      acting  acting_primary

> 33.151d active+undersized+degraded+remapped+backfilling

> [424,546,273,167,471,631,155,38,47]     424

> [424,546,273,167,471,631,155,38,2147483647]     424

> 33.6c1  active+undersized+degraded+remapped+backfilling

> [453,86,565,266,338,580,297,577,404]    453

> [453,86,565,266,338,2147483647,297,577,404]     453

> 33.17b7 active+undersized+degraded+remapped+backfilling

> [399,432,437,541,547,219,229,104,47]    399

> [399,432,437,541,547,219,229,104,2147483647]    399

> 33.150d active+undersized+degraded+remapped+backfilling

> [555,452,511,550,643,431,141,329,486]   555

> [555,2147483647,511,550,643,431,141,329,486]    555

> 33.13a8 active+undersized+degraded+remapped+backfilling

> [507,317,276,617,565,28,471,200,382]    507

> [507,2147483647,276,617,565,28,471,200,382]     507

> 33.4c1  active+undersized+degraded+remapped+backfilling

> [413,440,464,129,641,416,295,266,431]   413

> [413,440,2147483647,129,641,416,295,266,431]    413

>

> Based on a review of previous postings about this issue, I initially

> suspected that crush couldn't map the pg to an OSD (based on MAX_INT in the

> acting list), so I increased set_choose_tries from 50 to 200, and then again

> to 250 just to see if it would do anything.  These changes had no effect

> that I could discern.

>

> I next reviewed the output of ceph pg <pgid> query, and I see something

> similar to the following for each of my stuck pgs:

>

> {

>     "state": "active+undersized+degraded+remapped+backfilling",

>     "snap_trimq": "[]",

>     "epoch": 25211,

>     "up": [

>         453,

>         86,

>         565,

>         266,

>         338,

>         580,

>         297,

>         577,

>         404

>     ],

>     "acting": [

>         453,

>         86,

>         565,

>         266,

>         338,

>         2147483647,

>         297,

>         577,

>         404

>     ],

>     "backfill_targets": [

>         "580(5)"

>     ],

>     "actingbackfill": [

>         "86(1)",

>         "266(3)",

>         "297(6)",

>         "338(4)",

>         "404(8)",

>         "453(0)",

>         "565(2)",

>         "577(7)",

>         "580(5)"

>     ]

>

> In this case, 580 is a valid OSD on the node that lost the 3 OSDs (node 7).

> For the other five pgs, the situation is the same -- the backfill target is

> a valid OSD on node 7.

>

> If I dig further into the 'query' output, I encounter the following:

>

>     "recovery_state": [

>         {

>             "name": "Started\/Primary\/Active",

>             "enter_time": "2016-07-24 18:52:51.653375",

>             "might_have_unfound": [],

>             "recovery_progress": {

>                 "backfill_targets": [

>                     "580(5)"

>                 ],

>                 "waiting_on_backfill": [

>                     "580(5)"

>                 ],

>                 "last_backfill_started":

> "981926c1\/default.421929.15_MY_OBJECT",

>                 "backfill_info": {

>                     "begin":

> "391926c1\/default.9468.416_0080a34a\/head\/\/33",

>                     "end":

> "464b26c1\/default.284327.111_MBS-b965c481-4320-439b-ad56-9e4212c2fe8f\/CBB_WWTXPVDHCP\/C:\/Windows\/WinSxS\/amd64_wialx00a.inf_31bf3856ad364e35_6.3.9600.17415_none_b2e446f1befcebe5\/LXAA2DeviceDescription.xml:\/20150924104532\/LXAA2DeviceDescription.xml\/head\/\/33",

>                     "objects": [

>                         {

>                             "object":

> "391926c1\/default.9468.416_0080a34a\/head\/\/33",

>                             "version": "5356'86333"

>                         },

> ...

>

> Trying to understand what was going on, I shut off client traffic to the

> cluster and set debug_osd 20 debug_ms 1 on osd.580.  I see the following

> section repeated ad infinitum:

>

> === BEGIN LOG ===

>

> 2016-07-25 15:56:12.682241 7f262e8ed700  1 -- 10.54.10.27:6818/913781 <==

> osd.453 10.54.10.28:7010/1375782 236358 ==== pg_scan(get_digest 33.6c1s5

> 981926c1/default.421929.15_MY_OBJECT-0//0//33 e 25226/25226) v2 ==== 309+0+0

> (3953350617 0 0) 0x3a11d700 con 0x3656c420

> 2016-07-25 15:56:12.682273 7f262e8ed700 10 osd.580 25226 handle_replica_op

> pg_scan(get_digest 33.6c1s5 981926c1/default.421929.15_MY_OBJECT-0//0//33 e

> 25226/25226) v2 epoch 25226

> 2016-07-25 15:56:12.682278 7f262e8ed700 20 osd.580 25226 should_share_map

> osd.453 10.54.10.28:7010/1375782 25226

> 2016-07-25 15:56:12.682284 7f262e8ed700 15 osd.580 25226 enqueue_op

> 0x3d503600 prio 127 cost 0 latency 0.000051 pg_scan(get_digest 33.6c1s5

> 981926c1/default.421929.15_MY_OBJECT-0//0//33 e 25226/25226) v2

> 2016-07-25 15:56:12.682325 7f26724d1700 10 osd.580 25226 dequeue_op

> 0x3d503600 prio 127 cost 0 latency 0.000092 pg_scan(get_digest 33.6c1s5

> 981926c1/default.421929.15_MY_OBJECT-0//0//33 e 25226/25226) v2 pg

> pg[33.6c1s5( v 25226'149584 (5459'139410,25226'149584] lb

> 981926c1/default.421929.15_MY_OBJECT local-les=5635 n=33203 ec=390 les/c

> 5635/25223 25224/25225/25001)

> [453,86,565,266,338,580,297,577,404]/[453,86,565,266,338,2147483647,297,577,404]

> r=-1 lpr=25225 pi=5460-25224/117 luod=0'0 crt=25226'149584 active+remapped]

> 2016-07-25 15:56:12.682353 7f26724d1700 10 osd.580 pg_epoch: 25226

> pg[33.6c1s5( v 25226'149584 (5459'139410,25226'149584] lb

> 981926c1/default.421929.15_MY_OBJECT local-les=5635 n=33203 ec=390 les/c

> 5635/25223 25224/25225/25001)

> [453,86,565,266,338,580,297,577,404]/[453,86,565,266,338,2147483647,297,577,404]

> r=-1 lpr=25225 pi=5460-25224/117 luod=0'0 crt=25226'149584 active+remapped]

> handle_message: pg_scan(get_digest 33.6c1s5

> 981926c1/default.421929.15_MY_OBJECT-0//0//33 e 25226/25226) v2

> 2016-07-25 15:56:12.682366 7f26724d1700 10 osd.580 pg_epoch: 25226

> pg[33.6c1s5( v 25226'149584 (5459'139410,25226'149584] lb

> 981926c1/default.421929.15_MY_OBJECT local-les=5635 n=33203 ec=390 les/c

> 5635/25223 25224/25225/25001)

> [453,86,565,266,338,580,297,577,404]/[453,86,565,266,338,2147483647,297,577,404]

> r=-1 lpr=25225 pi=5460-25224/117 luod=0'0 crt=25226'149584 active+remapped]

> do_scan pg_scan(get_digest 33.6c1s5

> 981926c1/default.421929.15_MY_OBJECT-0//0//33 e 25226/25226) v2

> 2016-07-25 15:56:12.682377 7f26724d1700 10 osd.580 pg_epoch: 25226

> pg[33.6c1s5( v 25226'149584 (5459'139410,25226'149584] lb

> 981926c1/default.421929.15_MY_OBJECT local-les=5635 n=33203 ec=390 les/c

> 5635/25223 25224/25225/25001)

> [453,86,565,266,338,580,297,577,404]/[453,86,565,266,338,2147483647,297,577,404]

> r=-1 lpr=25225 pi=5460-25224/117 luod=0'0 crt=25226'149584 active+remapped]

> scan_range from 981926c1/default.421929.15_MY_OBJECT

> 2016-07-25 15:56:12.694086 7f26724d1700 10 osd.580 pg_epoch: 25226

> pg[33.6c1s5( v 25226'149584 (5459'139410,25226'149584] lb

> 981926c1/default.421929.15_MY_OBJECT local-les=5635 n=33203 ec=390 les/c

> 5635/25223 25224/25225/25001)

> [453,86,565,266,338,580,297,577,404]/[453,86,565,266,338,2147483647,297,577,404]

> r=-1 lpr=25225 pi=5460-25224/117 luod=0'0 crt=25226'149584 active+remapped]

> got 0 items, next 981926c1/default.421929.15_MY_OBJECT

> 2016-07-25 15:56:12.694113 7f26724d1700 20 osd.580 pg_epoch: 25226

> pg[33.6c1s5( v 25226'149584 (5459'139410,25226'149584] lb

> 981926c1/default.421929.15_MY_OBJECT local-les=5635 n=33203 ec=390 les/c

> 5635/25223 25224/25225/25001)

> [453,86,565,266,338,580,297,577,404]/[453,86,565,266,338,2147483647,297,577,404]

> r=-1 lpr=25225 pi=5460-25224/117 luod=0'0 crt=25226'149584 active+remapped]

> []

> 2016-07-25 15:56:12.694129 7f26724d1700  1 -- 10.54.10.27:6818/913781 -->

> 10.54.10.28:7010/1375782 -- pg_scan(digest 33.6c1s0

> 981926c1/default.421929.15_MY_OBJECT-981926c1/default.421929.15_MY_OBJECT e

> 25226/25226) v2 -- ?+4 0x3a7b7200 con 0x3656c420

> 2016-07-25 15:56:12.694233 7f26724d1700 10 osd.580 25226 dequeue_op

> 0x3d503600 finish

>

> === END LOG ===

>

> I'm in the process of digging through the OSD code to understand what's

> going on here, but I figured I would reach out to the community in the hopes

> that someone could point me in the right direction.  If anyone has seen this

> before and can recommend a course of action, I'm all ears.  And if there's

> any other information I can provide, please let me know what else would be

> helpful.

>

> Many thanks to any who can lend a hand or teach a man to fish.

>

> Brian Felton

>

> _______________________________________________

> ceph-users mailing list

> ceph-users@xxxxxxxxxxxxxx

> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com