If you don't have the orphaned file link, it's not the same bug. -Sam On Mon, Jul 25, 2016 at 12:55 PM, Brian Felton <bjfelton@xxxxxxxxx> wrote: > Sam, > > I'm reviewing that thread now, but I'm not seeing a lot of overlap with my > cluster's situation. For one, I am unable to start either a repair or a > deep scrub on any of the affected pgs. I've instructed all six of the pgs > to scrub, deep-scrub, and repair, and the cluster has been gleefully > ignoring these requests (it has been several hours since I first tried, and > the logs indicate none of the pgs ever scrubbed). Second, none of the my > OSDs is crashing. Third, none of my pgs or objects has ever been marked > inconsistent (or unfound, for that matter) -- I'm only seeing the standard > mix of degraded/misplaced objects that are common during a recovery. What > I'm not seeing is any further progress on the number of misplaced objects -- > the number has remained effectively unchanged for the past several days. > > To be sure, though, I tracked down the file that the backfill operation > seems to be hung on, and I can find it in both the backfill target osd (580) > and a few other osds in the pg. In all cases, I was able to find the file > with an identical hash value on all nodes, and I didn't find any duplicates > or potential orphans. Also, none of the objects involves have long names, > so they're not using the special ceph long filename handling. > > Also, we are not using XFS on our OSDs; we are using ZFS instead. > > If I'm misunderstanding the issue linked above and the corresponding thread, > please let me know. > > Brian > > > On Mon, Jul 25, 2016 at 1:32 PM, Samuel Just <sjust@xxxxxxxxxx> wrote: >> >> You may have hit http://tracker.ceph.com/issues/14766. There was a >> thread on the list a while back about diagnosing and fixing it. >> -Sam >> >> On Mon, Jul 25, 2016 at 10:45 AM, Brian Felton <bjfelton@xxxxxxxxx> wrote: >> > Greetings, >> > >> > Problem: After removing (out + crush remove + auth del + osd rm) three >> > osds >> > on a single host, I have six pgs that, after 10 days of recovery, are >> > stuck >> > in a state of active+undersized+degraded+remapped+backfilling. >> > >> > Cluster details: >> > - 9 hosts (32 cores, 256 GB RAM, Ubuntu 14.04, 72 6TB SAS2 drives per >> > host, >> > collocated journals) -- one host now has 69 drives >> > - Hammer 0.94.6 >> > - object storage use only >> > - erasure coded (k=7, m=2) .rgw.buckets pool (8192 pgs) >> > - failure domain of host >> > - cluster is currently storing 178TB over 260 MObjects (5-6% >> > utilization >> > per OSD) >> > - all 6 stuck pgs belong to .rgw.buckets >> > >> > The relevant section of our crushmap: >> > >> > rule .rgw.buckets { >> > ruleset 1 >> > type erasure >> > min_size 7 >> > max_size 9 >> > step set_chooseleaf_tries 5 >> > step set_choose_tries 250 >> > step take default >> > step chooseleaf indep 0 type host >> > step emit >> > } >> > >> > This isn't the first time we've lost a disk (not even the first time >> > we've >> > lost multiple disks on a host in a single event), so we're used to the >> > extended recovery times and understand this is going to be A Thing until >> > we >> > can introduce SSD journals. This is, however, the first time we've had >> > pgs >> > not return to an active+clean state after a couple days. As far as I >> > can >> > tell, our cluster is no longer making progress on the backfill >> > operations, >> > and I'm looking for advice on how to get things moving again. >> > >> > Here's a dump of the stuck pgs: >> > >> > ceph pg dump_stuck >> > ok >> > pg_stat state up up_primary acting acting_primary >> > 33.151d active+undersized+degraded+remapped+backfilling >> > [424,546,273,167,471,631,155,38,47] 424 >> > [424,546,273,167,471,631,155,38,2147483647] 424 >> > 33.6c1 active+undersized+degraded+remapped+backfilling >> > [453,86,565,266,338,580,297,577,404] 453 >> > [453,86,565,266,338,2147483647,297,577,404] 453 >> > 33.17b7 active+undersized+degraded+remapped+backfilling >> > [399,432,437,541,547,219,229,104,47] 399 >> > [399,432,437,541,547,219,229,104,2147483647] 399 >> > 33.150d active+undersized+degraded+remapped+backfilling >> > [555,452,511,550,643,431,141,329,486] 555 >> > [555,2147483647,511,550,643,431,141,329,486] 555 >> > 33.13a8 active+undersized+degraded+remapped+backfilling >> > [507,317,276,617,565,28,471,200,382] 507 >> > [507,2147483647,276,617,565,28,471,200,382] 507 >> > 33.4c1 active+undersized+degraded+remapped+backfilling >> > [413,440,464,129,641,416,295,266,431] 413 >> > [413,440,2147483647,129,641,416,295,266,431] 413 >> > >> > Based on a review of previous postings about this issue, I initially >> > suspected that crush couldn't map the pg to an OSD (based on MAX_INT in >> > the >> > acting list), so I increased set_choose_tries from 50 to 200, and then >> > again >> > to 250 just to see if it would do anything. These changes had no effect >> > that I could discern. >> > >> > I next reviewed the output of ceph pg <pgid> query, and I see something >> > similar to the following for each of my stuck pgs: >> > >> > { >> > "state": "active+undersized+degraded+remapped+backfilling", >> > "snap_trimq": "[]", >> > "epoch": 25211, >> > "up": [ >> > 453, >> > 86, >> > 565, >> > 266, >> > 338, >> > 580, >> > 297, >> > 577, >> > 404 >> > ], >> > "acting": [ >> > 453, >> > 86, >> > 565, >> > 266, >> > 338, >> > 2147483647, >> > 297, >> > 577, >> > 404 >> > ], >> > "backfill_targets": [ >> > "580(5)" >> > ], >> > "actingbackfill": [ >> > "86(1)", >> > "266(3)", >> > "297(6)", >> > "338(4)", >> > "404(8)", >> > "453(0)", >> > "565(2)", >> > "577(7)", >> > "580(5)" >> > ] >> > >> > In this case, 580 is a valid OSD on the node that lost the 3 OSDs (node >> > 7). >> > For the other five pgs, the situation is the same -- the backfill target >> > is >> > a valid OSD on node 7. >> > >> > If I dig further into the 'query' output, I encounter the following: >> > >> > "recovery_state": [ >> > { >> > "name": "Started\/Primary\/Active", >> > "enter_time": "2016-07-24 18:52:51.653375", >> > "might_have_unfound": [], >> > "recovery_progress": { >> > "backfill_targets": [ >> > "580(5)" >> > ], >> > "waiting_on_backfill": [ >> > "580(5)" >> > ], >> > "last_backfill_started": >> > "981926c1\/default.421929.15_MY_OBJECT", >> > "backfill_info": { >> > "begin": >> > "391926c1\/default.9468.416_0080a34a\/head\/\/33", >> > "end": >> > >> > "464b26c1\/default.284327.111_MBS-b965c481-4320-439b-ad56-9e4212c2fe8f\/CBB_WWTXPVDHCP\/C:\/Windows\/WinSxS\/amd64_wialx00a.inf_31bf3856ad364e35_6.3.9600.17415_none_b2e446f1befcebe5\/LXAA2DeviceDescription.xml:\/20150924104532\/LXAA2DeviceDescription.xml\/head\/\/33", >> > "objects": [ >> > { >> > "object": >> > "391926c1\/default.9468.416_0080a34a\/head\/\/33", >> > "version": "5356'86333" >> > }, >> > ... >> > >> > Trying to understand what was going on, I shut off client traffic to the >> > cluster and set debug_osd 20 debug_ms 1 on osd.580. I see the following >> > section repeated ad infinitum: >> > >> > === BEGIN LOG === >> > >> > 2016-07-25 15:56:12.682241 7f262e8ed700 1 -- 10.54.10.27:6818/913781 >> > <== >> > osd.453 10.54.10.28:7010/1375782 236358 ==== pg_scan(get_digest 33.6c1s5 >> > 981926c1/default.421929.15_MY_OBJECT-0//0//33 e 25226/25226) v2 ==== >> > 309+0+0 >> > (3953350617 0 0) 0x3a11d700 con 0x3656c420 >> > 2016-07-25 15:56:12.682273 7f262e8ed700 10 osd.580 25226 >> > handle_replica_op >> > pg_scan(get_digest 33.6c1s5 >> > 981926c1/default.421929.15_MY_OBJECT-0//0//33 e >> > 25226/25226) v2 epoch 25226 >> > 2016-07-25 15:56:12.682278 7f262e8ed700 20 osd.580 25226 >> > should_share_map >> > osd.453 10.54.10.28:7010/1375782 25226 >> > 2016-07-25 15:56:12.682284 7f262e8ed700 15 osd.580 25226 enqueue_op >> > 0x3d503600 prio 127 cost 0 latency 0.000051 pg_scan(get_digest 33.6c1s5 >> > 981926c1/default.421929.15_MY_OBJECT-0//0//33 e 25226/25226) v2 >> > 2016-07-25 15:56:12.682325 7f26724d1700 10 osd.580 25226 dequeue_op >> > 0x3d503600 prio 127 cost 0 latency 0.000092 pg_scan(get_digest 33.6c1s5 >> > 981926c1/default.421929.15_MY_OBJECT-0//0//33 e 25226/25226) v2 pg >> > pg[33.6c1s5( v 25226'149584 (5459'139410,25226'149584] lb >> > 981926c1/default.421929.15_MY_OBJECT local-les=5635 n=33203 ec=390 les/c >> > 5635/25223 25224/25225/25001) >> > >> > [453,86,565,266,338,580,297,577,404]/[453,86,565,266,338,2147483647,297,577,404] >> > r=-1 lpr=25225 pi=5460-25224/117 luod=0'0 crt=25226'149584 >> > active+remapped] >> > 2016-07-25 15:56:12.682353 7f26724d1700 10 osd.580 pg_epoch: 25226 >> > pg[33.6c1s5( v 25226'149584 (5459'139410,25226'149584] lb >> > 981926c1/default.421929.15_MY_OBJECT local-les=5635 n=33203 ec=390 les/c >> > 5635/25223 25224/25225/25001) >> > >> > [453,86,565,266,338,580,297,577,404]/[453,86,565,266,338,2147483647,297,577,404] >> > r=-1 lpr=25225 pi=5460-25224/117 luod=0'0 crt=25226'149584 >> > active+remapped] >> > handle_message: pg_scan(get_digest 33.6c1s5 >> > 981926c1/default.421929.15_MY_OBJECT-0//0//33 e 25226/25226) v2 >> > 2016-07-25 15:56:12.682366 7f26724d1700 10 osd.580 pg_epoch: 25226 >> > pg[33.6c1s5( v 25226'149584 (5459'139410,25226'149584] lb >> > 981926c1/default.421929.15_MY_OBJECT local-les=5635 n=33203 ec=390 les/c >> > 5635/25223 25224/25225/25001) >> > >> > [453,86,565,266,338,580,297,577,404]/[453,86,565,266,338,2147483647,297,577,404] >> > r=-1 lpr=25225 pi=5460-25224/117 luod=0'0 crt=25226'149584 >> > active+remapped] >> > do_scan pg_scan(get_digest 33.6c1s5 >> > 981926c1/default.421929.15_MY_OBJECT-0//0//33 e 25226/25226) v2 >> > 2016-07-25 15:56:12.682377 7f26724d1700 10 osd.580 pg_epoch: 25226 >> > pg[33.6c1s5( v 25226'149584 (5459'139410,25226'149584] lb >> > 981926c1/default.421929.15_MY_OBJECT local-les=5635 n=33203 ec=390 les/c >> > 5635/25223 25224/25225/25001) >> > >> > [453,86,565,266,338,580,297,577,404]/[453,86,565,266,338,2147483647,297,577,404] >> > r=-1 lpr=25225 pi=5460-25224/117 luod=0'0 crt=25226'149584 >> > active+remapped] >> > scan_range from 981926c1/default.421929.15_MY_OBJECT >> > 2016-07-25 15:56:12.694086 7f26724d1700 10 osd.580 pg_epoch: 25226 >> > pg[33.6c1s5( v 25226'149584 (5459'139410,25226'149584] lb >> > 981926c1/default.421929.15_MY_OBJECT local-les=5635 n=33203 ec=390 les/c >> > 5635/25223 25224/25225/25001) >> > >> > [453,86,565,266,338,580,297,577,404]/[453,86,565,266,338,2147483647,297,577,404] >> > r=-1 lpr=25225 pi=5460-25224/117 luod=0'0 crt=25226'149584 >> > active+remapped] >> > got 0 items, next 981926c1/default.421929.15_MY_OBJECT >> > 2016-07-25 15:56:12.694113 7f26724d1700 20 osd.580 pg_epoch: 25226 >> > pg[33.6c1s5( v 25226'149584 (5459'139410,25226'149584] lb >> > 981926c1/default.421929.15_MY_OBJECT local-les=5635 n=33203 ec=390 les/c >> > 5635/25223 25224/25225/25001) >> > >> > [453,86,565,266,338,580,297,577,404]/[453,86,565,266,338,2147483647,297,577,404] >> > r=-1 lpr=25225 pi=5460-25224/117 luod=0'0 crt=25226'149584 >> > active+remapped] >> > [] >> > 2016-07-25 15:56:12.694129 7f26724d1700 1 -- 10.54.10.27:6818/913781 >> > --> >> > 10.54.10.28:7010/1375782 -- pg_scan(digest 33.6c1s0 >> > >> > 981926c1/default.421929.15_MY_OBJECT-981926c1/default.421929.15_MY_OBJECT e >> > 25226/25226) v2 -- ?+4 0x3a7b7200 con 0x3656c420 >> > 2016-07-25 15:56:12.694233 7f26724d1700 10 osd.580 25226 dequeue_op >> > 0x3d503600 finish >> > >> > === END LOG === >> > >> > I'm in the process of digging through the OSD code to understand what's >> > going on here, but I figured I would reach out to the community in the >> > hopes >> > that someone could point me in the right direction. If anyone has seen >> > this >> > before and can recommend a course of action, I'm all ears. And if >> > there's >> > any other information I can provide, please let me know what else would >> > be >> > helpful. >> > >> > Many thanks to any who can lend a hand or teach a man to fish. >> > >> > Brian Felton >> > >> > _______________________________________________ >> > ceph-users mailing list >> > ceph-users@xxxxxxxxxxxxxx >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > > > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com