Re: Backfilling pgs not making progress

Brian Felton <bjfelton@xxxxxxxxx> · Tue, 26 Jul 2016 09:07:19 -0500

If I search on osd.580, I find default.421929.15\uTEPP\s84316222-6ddd-4ac9-8283-6fa1cdcf9b88\sbackups\s20160630091353\sp1\s\sShares\sWarehouse\sLondonWarehouse\sLondon\sRon picture's\sMISCELLANEOUS\s2014\sOct., 2014\sOct. 1\sDSC04329.JPG__head_981926C1__21_ffffffffffffffff_5, which has a non-zero size and a hash (981926C1) that matches that of the same file found on the other OSDs in the pg.

If I'm misunderstanding what you're asking about a dangling link, please point me in the right direction.

Brian

On Tue, Jul 26, 2016 at 8:59 AM, Samuel Just <sjust@xxxxxxxxxx> wrote:
Did you also confirm that the backfill target does not have any of

those dangling links?  I'd be looking for a dangling link for

981926c1/default.421929.15_TEPP/84316222-6ddd-4ac9-8283-6fa1cdcf9b88/backups/20160630091353/p1//Shares/Warehouse/LondonWarehouse/London/Ron

picture's/MISCELLANEOUS/2014/Oct., 2014/Oct. 1/DSC04329.JPG/head//33

on osd.580.

-Sam

On Mon, Jul 25, 2016 at 9:04 PM, Brian Felton <bjfelton@xxxxxxxxx> wrote:

> Sam,

>

> I cranked up the logging on the backfill target (osd 580 on node 07) and the

> acting primary for the pg (453 on node 08, for what it's worth).  The logs

> from the primary are very large, so pardon the tarballs.

>

> PG Primary Logs:

> https://www.dropbox.com/s/ipjobn2i5ban9km/backfill-primary-log.tgz?dl=0B

> PG Backfill Target Logs:

> https://www.dropbox.com/s/9qpiqsnahx0qc5k/backfill-target-log.tgz?dl=0

>

> I'll be reviewing them with my team tomorrow morning to see if we can find

> anything.  Thanks for your assistance.

>

> Brian

>

> On Mon, Jul 25, 2016 at 3:33 PM, Samuel Just <sjust@xxxxxxxxxx> wrote:

>>

>> The next thing I'd want is for you to reproduce with

>>

>> debug osd = 20

>> debug filestore = 20

>> debug ms = 1

>>

>> and post the file somewhere.

>> -Sam

>>

>> On Mon, Jul 25, 2016 at 1:33 PM, Samuel Just <sjust@xxxxxxxxxx> wrote:

>> > If you don't have the orphaned file link, it's not the same bug.

>> > -Sam

>> >

>> > On Mon, Jul 25, 2016 at 12:55 PM, Brian Felton <bjfelton@xxxxxxxxx>

>> > wrote:

>> >> Sam,

>> >>

>> >> I'm reviewing that thread now, but I'm not seeing a lot of overlap with

>> >> my

>> >> cluster's situation.  For one, I am unable to start either a repair or

>> >> a

>> >> deep scrub on any of the affected pgs.  I've instructed all six of the

>> >> pgs

>> >> to scrub, deep-scrub, and repair, and the cluster has been gleefully

>> >> ignoring these requests (it has been several hours since I first tried,

>> >> and

>> >> the logs indicate none of the pgs ever scrubbed).  Second, none of the

>> >> my

>> >> OSDs is crashing.  Third, none of my pgs or objects has ever been

>> >> marked

>> >> inconsistent (or unfound, for that matter) -- I'm only seeing the

>> >> standard

>> >> mix of degraded/misplaced objects that are common during a recovery.

>> >> What

>> >> I'm not seeing is any further progress on the number of misplaced

>> >> objects --

>> >> the number has remained effectively unchanged for the past several

>> >> days.

>> >>

>> >> To be sure, though, I tracked down the file that the backfill operation

>> >> seems to be hung on, and I can find it in both the backfill target osd

>> >> (580)

>> >> and a few other osds in the pg.  In all cases, I was able to find the

>> >> file

>> >> with an identical hash value on all nodes, and I didn't find any

>> >> duplicates

>> >> or potential orphans.  Also, none of the objects involves have long

>> >> names,

>> >> so they're not using the special ceph long filename handling.

>> >>

>> >> Also, we are not using XFS on our OSDs; we are using ZFS instead.

>> >>

>> >> If I'm misunderstanding the issue linked above and the corresponding

>> >> thread,

>> >> please let me know.

>> >>

>> >> Brian

>> >>

>> >>

>> >> On Mon, Jul 25, 2016 at 1:32 PM, Samuel Just <sjust@xxxxxxxxxx> wrote:

>> >>>

>> >>> You may have hit http://tracker.ceph.com/issues/14766.  There was a

>> >>> thread on the list a while back about diagnosing and fixing it.

>> >>> -Sam

>> >>>

>> >>> On Mon, Jul 25, 2016 at 10:45 AM, Brian Felton <bjfelton@xxxxxxxxx>

>> >>> wrote:

>> >>> > Greetings,

>> >>> >

>> >>> > Problem: After removing (out + crush remove + auth del + osd rm)

>> >>> > three

>> >>> > osds

>> >>> > on a single host, I have six pgs that, after 10 days of recovery,

>> >>> > are

>> >>> > stuck

>> >>> > in a state of active+undersized+degraded+remapped+backfilling.

>> >>> >

>> >>> > Cluster details:

>> >>> >  - 9 hosts (32 cores, 256 GB RAM, Ubuntu 14.04, 72 6TB SAS2 drives

>> >>> > per

>> >>> > host,

>> >>> > collocated journals) -- one host now has 69 drives

>> >>> >  - Hammer 0.94.6

>> >>> >  - object storage use only

>> >>> >  - erasure coded (k=7, m=2) .rgw.buckets pool (8192 pgs)

>> >>> >  - failure domain of host

>> >>> >  - cluster is currently storing 178TB over 260 MObjects (5-6%

>> >>> > utilization

>> >>> > per OSD)

>> >>> >  - all 6 stuck pgs belong to .rgw.buckets

>> >>> >

>> >>> > The relevant section of our crushmap:

>> >>> >

>> >>> > rule .rgw.buckets {

>> >>> >         ruleset 1

>> >>> >         type erasure

>> >>> >         min_size 7

>> >>> >         max_size 9

>> >>> >         step set_chooseleaf_tries 5

>> >>> >         step set_choose_tries 250

>> >>> >         step take default

>> >>> >         step chooseleaf indep 0 type host

>> >>> >         step emit

>> >>> > }

>> >>> >

>> >>> > This isn't the first time we've lost a disk (not even the first time

>> >>> > we've

>> >>> > lost multiple disks on a host in a single event), so we're used to

>> >>> > the

>> >>> > extended recovery times and understand this is going to be A Thing

>> >>> > until

>> >>> > we

>> >>> > can introduce SSD journals.  This is, however, the first time we've

>> >>> > had

>> >>> > pgs

>> >>> > not return to an active+clean state after a couple days.  As far as

>> >>> > I

>> >>> > can

>> >>> > tell, our cluster is no longer making progress on the backfill

>> >>> > operations,

>> >>> > and I'm looking for advice on how to get things moving again.

>> >>> >

>> >>> > Here's a dump of the stuck pgs:

>> >>> >

>> >>> > ceph pg dump_stuck

>> >>> > ok

>> >>> > pg_stat state   up      up_primary      acting  acting_primary

>> >>> > 33.151d active+undersized+degraded+remapped+backfilling

>> >>> > [424,546,273,167,471,631,155,38,47]     424

>> >>> > [424,546,273,167,471,631,155,38,2147483647]     424

>> >>> > 33.6c1  active+undersized+degraded+remapped+backfilling

>> >>> > [453,86,565,266,338,580,297,577,404]    453

>> >>> > [453,86,565,266,338,2147483647,297,577,404]     453

>> >>> > 33.17b7 active+undersized+degraded+remapped+backfilling

>> >>> > [399,432,437,541,547,219,229,104,47]    399

>> >>> > [399,432,437,541,547,219,229,104,2147483647]    399

>> >>> > 33.150d active+undersized+degraded+remapped+backfilling

>> >>> > [555,452,511,550,643,431,141,329,486]   555

>> >>> > [555,2147483647,511,550,643,431,141,329,486]    555

>> >>> > 33.13a8 active+undersized+degraded+remapped+backfilling

>> >>> > [507,317,276,617,565,28,471,200,382]    507

>> >>> > [507,2147483647,276,617,565,28,471,200,382]     507

>> >>> > 33.4c1  active+undersized+degraded+remapped+backfilling

>> >>> > [413,440,464,129,641,416,295,266,431]   413

>> >>> > [413,440,2147483647,129,641,416,295,266,431]    413

>> >>> >

>> >>> > Based on a review of previous postings about this issue, I initially

>> >>> > suspected that crush couldn't map the pg to an OSD (based on MAX_INT

>> >>> > in

>> >>> > the

>> >>> > acting list), so I increased set_choose_tries from 50 to 200, and

>> >>> > then

>> >>> > again

>> >>> > to 250 just to see if it would do anything.  These changes had no

>> >>> > effect

>> >>> > that I could discern.

>> >>> >

>> >>> > I next reviewed the output of ceph pg <pgid> query, and I see

>> >>> > something

>> >>> > similar to the following for each of my stuck pgs:

>> >>> >

>> >>> > {

>> >>> >     "state": "active+undersized+degraded+remapped+backfilling",

>> >>> >     "snap_trimq": "[]",

>> >>> >     "epoch": 25211,

>> >>> >     "up": [

>> >>> >         453,

>> >>> >         86,

>> >>> >         565,

>> >>> >         266,

>> >>> >         338,

>> >>> >         580,

>> >>> >         297,

>> >>> >         577,

>> >>> >         404

>> >>> >     ],

>> >>> >     "acting": [

>> >>> >         453,

>> >>> >         86,

>> >>> >         565,

>> >>> >         266,

>> >>> >         338,

>> >>> >         2147483647,

>> >>> >         297,

>> >>> >         577,

>> >>> >         404

>> >>> >     ],

>> >>> >     "backfill_targets": [

>> >>> >         "580(5)"

>> >>> >     ],

>> >>> >     "actingbackfill": [

>> >>> >         "86(1)",

>> >>> >         "266(3)",

>> >>> >         "297(6)",

>> >>> >         "338(4)",

>> >>> >         "404(8)",

>> >>> >         "453(0)",

>> >>> >         "565(2)",

>> >>> >         "577(7)",

>> >>> >         "580(5)"

>> >>> >     ]

>> >>> >

>> >>> > In this case, 580 is a valid OSD on the node that lost the 3 OSDs

>> >>> > (node

>> >>> > 7).

>> >>> > For the other five pgs, the situation is the same -- the backfill

>> >>> > target

>> >>> > is

>> >>> > a valid OSD on node 7.

>> >>> >

>> >>> > If I dig further into the 'query' output, I encounter the following:

>> >>> >

>> >>> >     "recovery_state": [

>> >>> >         {

>> >>> >             "name": "Started\/Primary\/Active",

>> >>> >             "enter_time": "2016-07-24 18:52:51.653375",

>> >>> >             "might_have_unfound": [],

>> >>> >             "recovery_progress": {

>> >>> >                 "backfill_targets": [

>> >>> >                     "580(5)"

>> >>> >                 ],

>> >>> >                 "waiting_on_backfill": [

>> >>> >                     "580(5)"

>> >>> >                 ],

>> >>> >                 "last_backfill_started":

>> >>> > "981926c1\/default.421929.15_MY_OBJECT",

>> >>> >                 "backfill_info": {

>> >>> >                     "begin":

>> >>> > "391926c1\/default.9468.416_0080a34a\/head\/\/33",

>> >>> >                     "end":

>> >>> >

>> >>> >

>> >>> > "464b26c1\/default.284327.111_MBS-b965c481-4320-439b-ad56-9e4212c2fe8f\/CBB_WWTXPVDHCP\/C:\/Windows\/WinSxS\/amd64_wialx00a.inf_31bf3856ad364e35_6.3.9600.17415_none_b2e446f1befcebe5\/LXAA2DeviceDescription.xml:\/20150924104532\/LXAA2DeviceDescription.xml\/head\/\/33",

>> >>> >                     "objects": [

>> >>> >                         {

>> >>> >                             "object":

>> >>> > "391926c1\/default.9468.416_0080a34a\/head\/\/33",

>> >>> >                             "version": "5356'86333"

>> >>> >                         },

>> >>> > ...

>> >>> >

>> >>> > Trying to understand what was going on, I shut off client traffic to

>> >>> > the

>> >>> > cluster and set debug_osd 20 debug_ms 1 on osd.580.  I see the

>> >>> > following

>> >>> > section repeated ad infinitum:

>> >>> >

>> >>> > === BEGIN LOG ===

>> >>> >

>> >>> > 2016-07-25 15:56:12.682241 7f262e8ed700  1 --

>> >>> > 10.54.10.27:6818/913781

>> >>> > <==

>> >>> > osd.453 10.54.10.28:7010/1375782 236358 ==== pg_scan(get_digest

>> >>> > 33.6c1s5

>> >>> > 981926c1/default.421929.15_MY_OBJECT-0//0//33 e 25226/25226) v2 ====

>> >>> > 309+0+0

>> >>> > (3953350617 0 0) 0x3a11d700 con 0x3656c420

>> >>> > 2016-07-25 15:56:12.682273 7f262e8ed700 10 osd.580 25226

>> >>> > handle_replica_op

>> >>> > pg_scan(get_digest 33.6c1s5

>> >>> > 981926c1/default.421929.15_MY_OBJECT-0//0//33 e

>> >>> > 25226/25226) v2 epoch 25226

>> >>> > 2016-07-25 15:56:12.682278 7f262e8ed700 20 osd.580 25226

>> >>> > should_share_map

>> >>> > osd.453 10.54.10.28:7010/1375782 25226

>> >>> > 2016-07-25 15:56:12.682284 7f262e8ed700 15 osd.580 25226 enqueue_op

>> >>> > 0x3d503600 prio 127 cost 0 latency 0.000051 pg_scan(get_digest

>> >>> > 33.6c1s5

>> >>> > 981926c1/default.421929.15_MY_OBJECT-0//0//33 e 25226/25226) v2

>> >>> > 2016-07-25 15:56:12.682325 7f26724d1700 10 osd.580 25226 dequeue_op

>> >>> > 0x3d503600 prio 127 cost 0 latency 0.000092 pg_scan(get_digest

>> >>> > 33.6c1s5

>> >>> > 981926c1/default.421929.15_MY_OBJECT-0//0//33 e 25226/25226) v2 pg

>> >>> > pg[33.6c1s5( v 25226'149584 (5459'139410,25226'149584] lb

>> >>> > 981926c1/default.421929.15_MY_OBJECT local-les=5635 n=33203 ec=390

>> >>> > les/c

>> >>> > 5635/25223 25224/25225/25001)

>> >>> >

>> >>> >

>> >>> > [453,86,565,266,338,580,297,577,404]/[453,86,565,266,338,2147483647,297,577,404]

>> >>> > r=-1 lpr=25225 pi=5460-25224/117 luod=0'0 crt=25226'149584

>> >>> > active+remapped]

>> >>> > 2016-07-25 15:56:12.682353 7f26724d1700 10 osd.580 pg_epoch: 25226

>> >>> > pg[33.6c1s5( v 25226'149584 (5459'139410,25226'149584] lb

>> >>> > 981926c1/default.421929.15_MY_OBJECT local-les=5635 n=33203 ec=390

>> >>> > les/c

>> >>> > 5635/25223 25224/25225/25001)

>> >>> >

>> >>> >

>> >>> > [453,86,565,266,338,580,297,577,404]/[453,86,565,266,338,2147483647,297,577,404]

>> >>> > r=-1 lpr=25225 pi=5460-25224/117 luod=0'0 crt=25226'149584

>> >>> > active+remapped]

>> >>> > handle_message: pg_scan(get_digest 33.6c1s5

>> >>> > 981926c1/default.421929.15_MY_OBJECT-0//0//33 e 25226/25226) v2

>> >>> > 2016-07-25 15:56:12.682366 7f26724d1700 10 osd.580 pg_epoch: 25226

>> >>> > pg[33.6c1s5( v 25226'149584 (5459'139410,25226'149584] lb

>> >>> > 981926c1/default.421929.15_MY_OBJECT local-les=5635 n=33203 ec=390

>> >>> > les/c

>> >>> > 5635/25223 25224/25225/25001)

>> >>> >

>> >>> >

>> >>> > [453,86,565,266,338,580,297,577,404]/[453,86,565,266,338,2147483647,297,577,404]

>> >>> > r=-1 lpr=25225 pi=5460-25224/117 luod=0'0 crt=25226'149584

>> >>> > active+remapped]

>> >>> > do_scan pg_scan(get_digest 33.6c1s5

>> >>> > 981926c1/default.421929.15_MY_OBJECT-0//0//33 e 25226/25226) v2

>> >>> > 2016-07-25 15:56:12.682377 7f26724d1700 10 osd.580 pg_epoch: 25226

>> >>> > pg[33.6c1s5( v 25226'149584 (5459'139410,25226'149584] lb

>> >>> > 981926c1/default.421929.15_MY_OBJECT local-les=5635 n=33203 ec=390

>> >>> > les/c

>> >>> > 5635/25223 25224/25225/25001)

>> >>> >

>> >>> >

>> >>> > [453,86,565,266,338,580,297,577,404]/[453,86,565,266,338,2147483647,297,577,404]

>> >>> > r=-1 lpr=25225 pi=5460-25224/117 luod=0'0 crt=25226'149584

>> >>> > active+remapped]

>> >>> > scan_range from 981926c1/default.421929.15_MY_OBJECT

>> >>> > 2016-07-25 15:56:12.694086 7f26724d1700 10 osd.580 pg_epoch: 25226

>> >>> > pg[33.6c1s5( v 25226'149584 (5459'139410,25226'149584] lb

>> >>> > 981926c1/default.421929.15_MY_OBJECT local-les=5635 n=33203 ec=390

>> >>> > les/c

>> >>> > 5635/25223 25224/25225/25001)

>> >>> >

>> >>> >

>> >>> > [453,86,565,266,338,580,297,577,404]/[453,86,565,266,338,2147483647,297,577,404]

>> >>> > r=-1 lpr=25225 pi=5460-25224/117 luod=0'0 crt=25226'149584

>> >>> > active+remapped]

>> >>> > got 0 items, next 981926c1/default.421929.15_MY_OBJECT

>> >>> > 2016-07-25 15:56:12.694113 7f26724d1700 20 osd.580 pg_epoch: 25226

>> >>> > pg[33.6c1s5( v 25226'149584 (5459'139410,25226'149584] lb

>> >>> > 981926c1/default.421929.15_MY_OBJECT local-les=5635 n=33203 ec=390

>> >>> > les/c

>> >>> > 5635/25223 25224/25225/25001)

>> >>> >

>> >>> >

>> >>> > [453,86,565,266,338,580,297,577,404]/[453,86,565,266,338,2147483647,297,577,404]

>> >>> > r=-1 lpr=25225 pi=5460-25224/117 luod=0'0 crt=25226'149584

>> >>> > active+remapped]

>> >>> > []

>> >>> > 2016-07-25 15:56:12.694129 7f26724d1700  1 --

>> >>> > 10.54.10.27:6818/913781

>> >>> > -->

>> >>> > 10.54.10.28:7010/1375782 -- pg_scan(digest 33.6c1s0

>> >>> >

>> >>> >

>> >>> > 981926c1/default.421929.15_MY_OBJECT-981926c1/default.421929.15_MY_OBJECT e

>> >>> > 25226/25226) v2 -- ?+4 0x3a7b7200 con 0x3656c420

>> >>> > 2016-07-25 15:56:12.694233 7f26724d1700 10 osd.580 25226 dequeue_op

>> >>> > 0x3d503600 finish

>> >>> >

>> >>> > === END LOG ===

>> >>> >

>> >>> > I'm in the process of digging through the OSD code to understand

>> >>> > what's

>> >>> > going on here, but I figured I would reach out to the community in

>> >>> > the

>> >>> > hopes

>> >>> > that someone could point me in the right direction.  If anyone has

>> >>> > seen

>> >>> > this

>> >>> > before and can recommend a course of action, I'm all ears.  And if

>> >>> > there's

>> >>> > any other information I can provide, please let me know what else

>> >>> > would

>> >>> > be

>> >>> > helpful.

>> >>> >

>> >>> > Many thanks to any who can lend a hand or teach a man to fish.

>> >>> >

>> >>> > Brian Felton

>> >>> >

>> >>> > _______________________________________________

>> >>> > ceph-users mailing list

>> >>> > ceph-users@xxxxxxxxxxxxxx

>> >>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>> >>> >

>> >>

>> >>

>

>

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com