I just updated the bug with several questions. -Sam On Thu, Aug 11, 2016 at 6:56 AM, Brian Felton <bjfelton@xxxxxxxxx> wrote: > Sam, > > I very much appreciate the assistance. I have opened > http://tracker.ceph.com/issues/16997 to track this (potential) issue. > > Brian > > On Wed, Aug 10, 2016 at 1:53 PM, Samuel Just <sjust@xxxxxxxxxx> wrote: >> >> Ok, can you >> 1) Open a bug >> 2) Identify all osds involved in the 5 problem pgs >> 3) enable debug osd = 20, debug filestore = 20, debug ms = 1 on all of >> them >> 4) mark the primary for each pg down (should cause peering and >> backfill to restart) >> 5) link all logs to the bug >> >> Thanks! >> -Sam >> >> On Tue, Jul 26, 2016 at 9:11 AM, Samuel Just <sjust@xxxxxxxxxx> wrote: >> > Hmm, nvm, it's not an lfn object anyway. >> > -Sam >> > >> > On Tue, Jul 26, 2016 at 7:07 AM, Brian Felton <bjfelton@xxxxxxxxx> >> > wrote: >> >> If I search on osd.580, I find >> >> >> >> default.421929.15\uTEPP\s84316222-6ddd-4ac9-8283-6fa1cdcf9b88\sbackups\s20160630091353\sp1\s\sShares\sWarehouse\sLondonWarehouse\sLondon\sRon >> >> picture's\sMISCELLANEOUS\s2014\sOct., 2014\sOct. >> >> 1\sDSC04329.JPG__head_981926C1__21_ffffffffffffffff_5, which has a >> >> non-zero >> >> size and a hash (981926C1) that matches that of the same file found on >> >> the >> >> other OSDs in the pg. >> >> >> >> If I'm misunderstanding what you're asking about a dangling link, >> >> please >> >> point me in the right direction. >> >> >> >> Brian >> >> >> >> On Tue, Jul 26, 2016 at 8:59 AM, Samuel Just <sjust@xxxxxxxxxx> wrote: >> >>> >> >>> Did you also confirm that the backfill target does not have any of >> >>> those dangling links? I'd be looking for a dangling link for >> >>> >> >>> >> >>> 981926c1/default.421929.15_TEPP/84316222-6ddd-4ac9-8283-6fa1cdcf9b88/backups/20160630091353/p1//Shares/Warehouse/LondonWarehouse/London/Ron >> >>> picture's/MISCELLANEOUS/2014/Oct., 2014/Oct. 1/DSC04329.JPG/head//33 >> >>> on osd.580. >> >>> -Sam >> >>> >> >>> On Mon, Jul 25, 2016 at 9:04 PM, Brian Felton <bjfelton@xxxxxxxxx> >> >>> wrote: >> >>> > Sam, >> >>> > >> >>> > I cranked up the logging on the backfill target (osd 580 on node 07) >> >>> > and >> >>> > the >> >>> > acting primary for the pg (453 on node 08, for what it's worth). >> >>> > The >> >>> > logs >> >>> > from the primary are very large, so pardon the tarballs. >> >>> > >> >>> > PG Primary Logs: >> >>> > >> >>> > https://www.dropbox.com/s/ipjobn2i5ban9km/backfill-primary-log.tgz?dl=0B >> >>> > PG Backfill Target Logs: >> >>> > >> >>> > https://www.dropbox.com/s/9qpiqsnahx0qc5k/backfill-target-log.tgz?dl=0 >> >>> > >> >>> > I'll be reviewing them with my team tomorrow morning to see if we >> >>> > can >> >>> > find >> >>> > anything. Thanks for your assistance. >> >>> > >> >>> > Brian >> >>> > >> >>> > On Mon, Jul 25, 2016 at 3:33 PM, Samuel Just <sjust@xxxxxxxxxx> >> >>> > wrote: >> >>> >> >> >>> >> The next thing I'd want is for you to reproduce with >> >>> >> >> >>> >> debug osd = 20 >> >>> >> debug filestore = 20 >> >>> >> debug ms = 1 >> >>> >> >> >>> >> and post the file somewhere. >> >>> >> -Sam >> >>> >> >> >>> >> On Mon, Jul 25, 2016 at 1:33 PM, Samuel Just <sjust@xxxxxxxxxx> >> >>> >> wrote: >> >>> >> > If you don't have the orphaned file link, it's not the same bug. >> >>> >> > -Sam >> >>> >> > >> >>> >> > On Mon, Jul 25, 2016 at 12:55 PM, Brian Felton >> >>> >> > <bjfelton@xxxxxxxxx> >> >>> >> > wrote: >> >>> >> >> Sam, >> >>> >> >> >> >>> >> >> I'm reviewing that thread now, but I'm not seeing a lot of >> >>> >> >> overlap >> >>> >> >> with >> >>> >> >> my >> >>> >> >> cluster's situation. For one, I am unable to start either a >> >>> >> >> repair >> >>> >> >> or >> >>> >> >> a >> >>> >> >> deep scrub on any of the affected pgs. I've instructed all six >> >>> >> >> of >> >>> >> >> the >> >>> >> >> pgs >> >>> >> >> to scrub, deep-scrub, and repair, and the cluster has been >> >>> >> >> gleefully >> >>> >> >> ignoring these requests (it has been several hours since I first >> >>> >> >> tried, >> >>> >> >> and >> >>> >> >> the logs indicate none of the pgs ever scrubbed). Second, none >> >>> >> >> of >> >>> >> >> the >> >>> >> >> my >> >>> >> >> OSDs is crashing. Third, none of my pgs or objects has ever >> >>> >> >> been >> >>> >> >> marked >> >>> >> >> inconsistent (or unfound, for that matter) -- I'm only seeing >> >>> >> >> the >> >>> >> >> standard >> >>> >> >> mix of degraded/misplaced objects that are common during a >> >>> >> >> recovery. >> >>> >> >> What >> >>> >> >> I'm not seeing is any further progress on the number of >> >>> >> >> misplaced >> >>> >> >> objects -- >> >>> >> >> the number has remained effectively unchanged for the past >> >>> >> >> several >> >>> >> >> days. >> >>> >> >> >> >>> >> >> To be sure, though, I tracked down the file that the backfill >> >>> >> >> operation >> >>> >> >> seems to be hung on, and I can find it in both the backfill >> >>> >> >> target >> >>> >> >> osd >> >>> >> >> (580) >> >>> >> >> and a few other osds in the pg. In all cases, I was able to >> >>> >> >> find >> >>> >> >> the >> >>> >> >> file >> >>> >> >> with an identical hash value on all nodes, and I didn't find any >> >>> >> >> duplicates >> >>> >> >> or potential orphans. Also, none of the objects involves have >> >>> >> >> long >> >>> >> >> names, >> >>> >> >> so they're not using the special ceph long filename handling. >> >>> >> >> >> >>> >> >> Also, we are not using XFS on our OSDs; we are using ZFS >> >>> >> >> instead. >> >>> >> >> >> >>> >> >> If I'm misunderstanding the issue linked above and the >> >>> >> >> corresponding >> >>> >> >> thread, >> >>> >> >> please let me know. >> >>> >> >> >> >>> >> >> Brian >> >>> >> >> >> >>> >> >> >> >>> >> >> On Mon, Jul 25, 2016 at 1:32 PM, Samuel Just <sjust@xxxxxxxxxx> >> >>> >> >> wrote: >> >>> >> >>> >> >>> >> >>> You may have hit http://tracker.ceph.com/issues/14766. There >> >>> >> >>> was a >> >>> >> >>> thread on the list a while back about diagnosing and fixing it. >> >>> >> >>> -Sam >> >>> >> >>> >> >>> >> >>> On Mon, Jul 25, 2016 at 10:45 AM, Brian Felton >> >>> >> >>> <bjfelton@xxxxxxxxx> >> >>> >> >>> wrote: >> >>> >> >>> > Greetings, >> >>> >> >>> > >> >>> >> >>> > Problem: After removing (out + crush remove + auth del + osd >> >>> >> >>> > rm) >> >>> >> >>> > three >> >>> >> >>> > osds >> >>> >> >>> > on a single host, I have six pgs that, after 10 days of >> >>> >> >>> > recovery, >> >>> >> >>> > are >> >>> >> >>> > stuck >> >>> >> >>> > in a state of >> >>> >> >>> > active+undersized+degraded+remapped+backfilling. >> >>> >> >>> > >> >>> >> >>> > Cluster details: >> >>> >> >>> > - 9 hosts (32 cores, 256 GB RAM, Ubuntu 14.04, 72 6TB SAS2 >> >>> >> >>> > drives >> >>> >> >>> > per >> >>> >> >>> > host, >> >>> >> >>> > collocated journals) -- one host now has 69 drives >> >>> >> >>> > - Hammer 0.94.6 >> >>> >> >>> > - object storage use only >> >>> >> >>> > - erasure coded (k=7, m=2) .rgw.buckets pool (8192 pgs) >> >>> >> >>> > - failure domain of host >> >>> >> >>> > - cluster is currently storing 178TB over 260 MObjects (5-6% >> >>> >> >>> > utilization >> >>> >> >>> > per OSD) >> >>> >> >>> > - all 6 stuck pgs belong to .rgw.buckets >> >>> >> >>> > >> >>> >> >>> > The relevant section of our crushmap: >> >>> >> >>> > >> >>> >> >>> > rule .rgw.buckets { >> >>> >> >>> > ruleset 1 >> >>> >> >>> > type erasure >> >>> >> >>> > min_size 7 >> >>> >> >>> > max_size 9 >> >>> >> >>> > step set_chooseleaf_tries 5 >> >>> >> >>> > step set_choose_tries 250 >> >>> >> >>> > step take default >> >>> >> >>> > step chooseleaf indep 0 type host >> >>> >> >>> > step emit >> >>> >> >>> > } >> >>> >> >>> > >> >>> >> >>> > This isn't the first time we've lost a disk (not even the >> >>> >> >>> > first >> >>> >> >>> > time >> >>> >> >>> > we've >> >>> >> >>> > lost multiple disks on a host in a single event), so we're >> >>> >> >>> > used >> >>> >> >>> > to >> >>> >> >>> > the >> >>> >> >>> > extended recovery times and understand this is going to be A >> >>> >> >>> > Thing >> >>> >> >>> > until >> >>> >> >>> > we >> >>> >> >>> > can introduce SSD journals. This is, however, the first time >> >>> >> >>> > we've >> >>> >> >>> > had >> >>> >> >>> > pgs >> >>> >> >>> > not return to an active+clean state after a couple days. As >> >>> >> >>> > far >> >>> >> >>> > as >> >>> >> >>> > I >> >>> >> >>> > can >> >>> >> >>> > tell, our cluster is no longer making progress on the >> >>> >> >>> > backfill >> >>> >> >>> > operations, >> >>> >> >>> > and I'm looking for advice on how to get things moving again. >> >>> >> >>> > >> >>> >> >>> > Here's a dump of the stuck pgs: >> >>> >> >>> > >> >>> >> >>> > ceph pg dump_stuck >> >>> >> >>> > ok >> >>> >> >>> > pg_stat state up up_primary acting >> >>> >> >>> > acting_primary >> >>> >> >>> > 33.151d active+undersized+degraded+remapped+backfilling >> >>> >> >>> > [424,546,273,167,471,631,155,38,47] 424 >> >>> >> >>> > [424,546,273,167,471,631,155,38,2147483647] 424 >> >>> >> >>> > 33.6c1 active+undersized+degraded+remapped+backfilling >> >>> >> >>> > [453,86,565,266,338,580,297,577,404] 453 >> >>> >> >>> > [453,86,565,266,338,2147483647,297,577,404] 453 >> >>> >> >>> > 33.17b7 active+undersized+degraded+remapped+backfilling >> >>> >> >>> > [399,432,437,541,547,219,229,104,47] 399 >> >>> >> >>> > [399,432,437,541,547,219,229,104,2147483647] 399 >> >>> >> >>> > 33.150d active+undersized+degraded+remapped+backfilling >> >>> >> >>> > [555,452,511,550,643,431,141,329,486] 555 >> >>> >> >>> > [555,2147483647,511,550,643,431,141,329,486] 555 >> >>> >> >>> > 33.13a8 active+undersized+degraded+remapped+backfilling >> >>> >> >>> > [507,317,276,617,565,28,471,200,382] 507 >> >>> >> >>> > [507,2147483647,276,617,565,28,471,200,382] 507 >> >>> >> >>> > 33.4c1 active+undersized+degraded+remapped+backfilling >> >>> >> >>> > [413,440,464,129,641,416,295,266,431] 413 >> >>> >> >>> > [413,440,2147483647,129,641,416,295,266,431] 413 >> >>> >> >>> > >> >>> >> >>> > Based on a review of previous postings about this issue, I >> >>> >> >>> > initially >> >>> >> >>> > suspected that crush couldn't map the pg to an OSD (based on >> >>> >> >>> > MAX_INT >> >>> >> >>> > in >> >>> >> >>> > the >> >>> >> >>> > acting list), so I increased set_choose_tries from 50 to 200, >> >>> >> >>> > and >> >>> >> >>> > then >> >>> >> >>> > again >> >>> >> >>> > to 250 just to see if it would do anything. These changes >> >>> >> >>> > had no >> >>> >> >>> > effect >> >>> >> >>> > that I could discern. >> >>> >> >>> > >> >>> >> >>> > I next reviewed the output of ceph pg <pgid> query, and I see >> >>> >> >>> > something >> >>> >> >>> > similar to the following for each of my stuck pgs: >> >>> >> >>> > >> >>> >> >>> > { >> >>> >> >>> > "state": >> >>> >> >>> > "active+undersized+degraded+remapped+backfilling", >> >>> >> >>> > "snap_trimq": "[]", >> >>> >> >>> > "epoch": 25211, >> >>> >> >>> > "up": [ >> >>> >> >>> > 453, >> >>> >> >>> > 86, >> >>> >> >>> > 565, >> >>> >> >>> > 266, >> >>> >> >>> > 338, >> >>> >> >>> > 580, >> >>> >> >>> > 297, >> >>> >> >>> > 577, >> >>> >> >>> > 404 >> >>> >> >>> > ], >> >>> >> >>> > "acting": [ >> >>> >> >>> > 453, >> >>> >> >>> > 86, >> >>> >> >>> > 565, >> >>> >> >>> > 266, >> >>> >> >>> > 338, >> >>> >> >>> > 2147483647, >> >>> >> >>> > 297, >> >>> >> >>> > 577, >> >>> >> >>> > 404 >> >>> >> >>> > ], >> >>> >> >>> > "backfill_targets": [ >> >>> >> >>> > "580(5)" >> >>> >> >>> > ], >> >>> >> >>> > "actingbackfill": [ >> >>> >> >>> > "86(1)", >> >>> >> >>> > "266(3)", >> >>> >> >>> > "297(6)", >> >>> >> >>> > "338(4)", >> >>> >> >>> > "404(8)", >> >>> >> >>> > "453(0)", >> >>> >> >>> > "565(2)", >> >>> >> >>> > "577(7)", >> >>> >> >>> > "580(5)" >> >>> >> >>> > ] >> >>> >> >>> > >> >>> >> >>> > In this case, 580 is a valid OSD on the node that lost the 3 >> >>> >> >>> > OSDs >> >>> >> >>> > (node >> >>> >> >>> > 7). >> >>> >> >>> > For the other five pgs, the situation is the same -- the >> >>> >> >>> > backfill >> >>> >> >>> > target >> >>> >> >>> > is >> >>> >> >>> > a valid OSD on node 7. >> >>> >> >>> > >> >>> >> >>> > If I dig further into the 'query' output, I encounter the >> >>> >> >>> > following: >> >>> >> >>> > >> >>> >> >>> > "recovery_state": [ >> >>> >> >>> > { >> >>> >> >>> > "name": "Started\/Primary\/Active", >> >>> >> >>> > "enter_time": "2016-07-24 18:52:51.653375", >> >>> >> >>> > "might_have_unfound": [], >> >>> >> >>> > "recovery_progress": { >> >>> >> >>> > "backfill_targets": [ >> >>> >> >>> > "580(5)" >> >>> >> >>> > ], >> >>> >> >>> > "waiting_on_backfill": [ >> >>> >> >>> > "580(5)" >> >>> >> >>> > ], >> >>> >> >>> > "last_backfill_started": >> >>> >> >>> > "981926c1\/default.421929.15_MY_OBJECT", >> >>> >> >>> > "backfill_info": { >> >>> >> >>> > "begin": >> >>> >> >>> > "391926c1\/default.9468.416_0080a34a\/head\/\/33", >> >>> >> >>> > "end": >> >>> >> >>> > >> >>> >> >>> > >> >>> >> >>> > >> >>> >> >>> > >> >>> >> >>> > "464b26c1\/default.284327.111_MBS-b965c481-4320-439b-ad56-9e4212c2fe8f\/CBB_WWTXPVDHCP\/C:\/Windows\/WinSxS\/amd64_wialx00a.inf_31bf3856ad364e35_6.3.9600.17415_none_b2e446f1befcebe5\/LXAA2DeviceDescription.xml:\/20150924104532\/LXAA2DeviceDescription.xml\/head\/\/33", >> >>> >> >>> > "objects": [ >> >>> >> >>> > { >> >>> >> >>> > "object": >> >>> >> >>> > "391926c1\/default.9468.416_0080a34a\/head\/\/33", >> >>> >> >>> > "version": "5356'86333" >> >>> >> >>> > }, >> >>> >> >>> > ... >> >>> >> >>> > >> >>> >> >>> > Trying to understand what was going on, I shut off client >> >>> >> >>> > traffic >> >>> >> >>> > to >> >>> >> >>> > the >> >>> >> >>> > cluster and set debug_osd 20 debug_ms 1 on osd.580. I see >> >>> >> >>> > the >> >>> >> >>> > following >> >>> >> >>> > section repeated ad infinitum: >> >>> >> >>> > >> >>> >> >>> > === BEGIN LOG === >> >>> >> >>> > >> >>> >> >>> > 2016-07-25 15:56:12.682241 7f262e8ed700 1 -- >> >>> >> >>> > 10.54.10.27:6818/913781 >> >>> >> >>> > <== >> >>> >> >>> > osd.453 10.54.10.28:7010/1375782 236358 ==== >> >>> >> >>> > pg_scan(get_digest >> >>> >> >>> > 33.6c1s5 >> >>> >> >>> > 981926c1/default.421929.15_MY_OBJECT-0//0//33 e 25226/25226) >> >>> >> >>> > v2 >> >>> >> >>> > ==== >> >>> >> >>> > 309+0+0 >> >>> >> >>> > (3953350617 0 0) 0x3a11d700 con 0x3656c420 >> >>> >> >>> > 2016-07-25 15:56:12.682273 7f262e8ed700 10 osd.580 25226 >> >>> >> >>> > handle_replica_op >> >>> >> >>> > pg_scan(get_digest 33.6c1s5 >> >>> >> >>> > 981926c1/default.421929.15_MY_OBJECT-0//0//33 e >> >>> >> >>> > 25226/25226) v2 epoch 25226 >> >>> >> >>> > 2016-07-25 15:56:12.682278 7f262e8ed700 20 osd.580 25226 >> >>> >> >>> > should_share_map >> >>> >> >>> > osd.453 10.54.10.28:7010/1375782 25226 >> >>> >> >>> > 2016-07-25 15:56:12.682284 7f262e8ed700 15 osd.580 25226 >> >>> >> >>> > enqueue_op >> >>> >> >>> > 0x3d503600 prio 127 cost 0 latency 0.000051 >> >>> >> >>> > pg_scan(get_digest >> >>> >> >>> > 33.6c1s5 >> >>> >> >>> > 981926c1/default.421929.15_MY_OBJECT-0//0//33 e 25226/25226) >> >>> >> >>> > v2 >> >>> >> >>> > 2016-07-25 15:56:12.682325 7f26724d1700 10 osd.580 25226 >> >>> >> >>> > dequeue_op >> >>> >> >>> > 0x3d503600 prio 127 cost 0 latency 0.000092 >> >>> >> >>> > pg_scan(get_digest >> >>> >> >>> > 33.6c1s5 >> >>> >> >>> > 981926c1/default.421929.15_MY_OBJECT-0//0//33 e 25226/25226) >> >>> >> >>> > v2 >> >>> >> >>> > pg >> >>> >> >>> > pg[33.6c1s5( v 25226'149584 (5459'139410,25226'149584] lb >> >>> >> >>> > 981926c1/default.421929.15_MY_OBJECT local-les=5635 n=33203 >> >>> >> >>> > ec=390 >> >>> >> >>> > les/c >> >>> >> >>> > 5635/25223 25224/25225/25001) >> >>> >> >>> > >> >>> >> >>> > >> >>> >> >>> > >> >>> >> >>> > >> >>> >> >>> > [453,86,565,266,338,580,297,577,404]/[453,86,565,266,338,2147483647,297,577,404] >> >>> >> >>> > r=-1 lpr=25225 pi=5460-25224/117 luod=0'0 crt=25226'149584 >> >>> >> >>> > active+remapped] >> >>> >> >>> > 2016-07-25 15:56:12.682353 7f26724d1700 10 osd.580 pg_epoch: >> >>> >> >>> > 25226 >> >>> >> >>> > pg[33.6c1s5( v 25226'149584 (5459'139410,25226'149584] lb >> >>> >> >>> > 981926c1/default.421929.15_MY_OBJECT local-les=5635 n=33203 >> >>> >> >>> > ec=390 >> >>> >> >>> > les/c >> >>> >> >>> > 5635/25223 25224/25225/25001) >> >>> >> >>> > >> >>> >> >>> > >> >>> >> >>> > >> >>> >> >>> > >> >>> >> >>> > [453,86,565,266,338,580,297,577,404]/[453,86,565,266,338,2147483647,297,577,404] >> >>> >> >>> > r=-1 lpr=25225 pi=5460-25224/117 luod=0'0 crt=25226'149584 >> >>> >> >>> > active+remapped] >> >>> >> >>> > handle_message: pg_scan(get_digest 33.6c1s5 >> >>> >> >>> > 981926c1/default.421929.15_MY_OBJECT-0//0//33 e 25226/25226) >> >>> >> >>> > v2 >> >>> >> >>> > 2016-07-25 15:56:12.682366 7f26724d1700 10 osd.580 pg_epoch: >> >>> >> >>> > 25226 >> >>> >> >>> > pg[33.6c1s5( v 25226'149584 (5459'139410,25226'149584] lb >> >>> >> >>> > 981926c1/default.421929.15_MY_OBJECT local-les=5635 n=33203 >> >>> >> >>> > ec=390 >> >>> >> >>> > les/c >> >>> >> >>> > 5635/25223 25224/25225/25001) >> >>> >> >>> > >> >>> >> >>> > >> >>> >> >>> > >> >>> >> >>> > >> >>> >> >>> > [453,86,565,266,338,580,297,577,404]/[453,86,565,266,338,2147483647,297,577,404] >> >>> >> >>> > r=-1 lpr=25225 pi=5460-25224/117 luod=0'0 crt=25226'149584 >> >>> >> >>> > active+remapped] >> >>> >> >>> > do_scan pg_scan(get_digest 33.6c1s5 >> >>> >> >>> > 981926c1/default.421929.15_MY_OBJECT-0//0//33 e 25226/25226) >> >>> >> >>> > v2 >> >>> >> >>> > 2016-07-25 15:56:12.682377 7f26724d1700 10 osd.580 pg_epoch: >> >>> >> >>> > 25226 >> >>> >> >>> > pg[33.6c1s5( v 25226'149584 (5459'139410,25226'149584] lb >> >>> >> >>> > 981926c1/default.421929.15_MY_OBJECT local-les=5635 n=33203 >> >>> >> >>> > ec=390 >> >>> >> >>> > les/c >> >>> >> >>> > 5635/25223 25224/25225/25001) >> >>> >> >>> > >> >>> >> >>> > >> >>> >> >>> > >> >>> >> >>> > >> >>> >> >>> > [453,86,565,266,338,580,297,577,404]/[453,86,565,266,338,2147483647,297,577,404] >> >>> >> >>> > r=-1 lpr=25225 pi=5460-25224/117 luod=0'0 crt=25226'149584 >> >>> >> >>> > active+remapped] >> >>> >> >>> > scan_range from 981926c1/default.421929.15_MY_OBJECT >> >>> >> >>> > 2016-07-25 15:56:12.694086 7f26724d1700 10 osd.580 pg_epoch: >> >>> >> >>> > 25226 >> >>> >> >>> > pg[33.6c1s5( v 25226'149584 (5459'139410,25226'149584] lb >> >>> >> >>> > 981926c1/default.421929.15_MY_OBJECT local-les=5635 n=33203 >> >>> >> >>> > ec=390 >> >>> >> >>> > les/c >> >>> >> >>> > 5635/25223 25224/25225/25001) >> >>> >> >>> > >> >>> >> >>> > >> >>> >> >>> > >> >>> >> >>> > >> >>> >> >>> > [453,86,565,266,338,580,297,577,404]/[453,86,565,266,338,2147483647,297,577,404] >> >>> >> >>> > r=-1 lpr=25225 pi=5460-25224/117 luod=0'0 crt=25226'149584 >> >>> >> >>> > active+remapped] >> >>> >> >>> > got 0 items, next 981926c1/default.421929.15_MY_OBJECT >> >>> >> >>> > 2016-07-25 15:56:12.694113 7f26724d1700 20 osd.580 pg_epoch: >> >>> >> >>> > 25226 >> >>> >> >>> > pg[33.6c1s5( v 25226'149584 (5459'139410,25226'149584] lb >> >>> >> >>> > 981926c1/default.421929.15_MY_OBJECT local-les=5635 n=33203 >> >>> >> >>> > ec=390 >> >>> >> >>> > les/c >> >>> >> >>> > 5635/25223 25224/25225/25001) >> >>> >> >>> > >> >>> >> >>> > >> >>> >> >>> > >> >>> >> >>> > >> >>> >> >>> > [453,86,565,266,338,580,297,577,404]/[453,86,565,266,338,2147483647,297,577,404] >> >>> >> >>> > r=-1 lpr=25225 pi=5460-25224/117 luod=0'0 crt=25226'149584 >> >>> >> >>> > active+remapped] >> >>> >> >>> > [] >> >>> >> >>> > 2016-07-25 15:56:12.694129 7f26724d1700 1 -- >> >>> >> >>> > 10.54.10.27:6818/913781 >> >>> >> >>> > --> >> >>> >> >>> > 10.54.10.28:7010/1375782 -- pg_scan(digest 33.6c1s0 >> >>> >> >>> > >> >>> >> >>> > >> >>> >> >>> > >> >>> >> >>> > >> >>> >> >>> > 981926c1/default.421929.15_MY_OBJECT-981926c1/default.421929.15_MY_OBJECT e >> >>> >> >>> > 25226/25226) v2 -- ?+4 0x3a7b7200 con 0x3656c420 >> >>> >> >>> > 2016-07-25 15:56:12.694233 7f26724d1700 10 osd.580 25226 >> >>> >> >>> > dequeue_op >> >>> >> >>> > 0x3d503600 finish >> >>> >> >>> > >> >>> >> >>> > === END LOG === >> >>> >> >>> > >> >>> >> >>> > I'm in the process of digging through the OSD code to >> >>> >> >>> > understand >> >>> >> >>> > what's >> >>> >> >>> > going on here, but I figured I would reach out to the >> >>> >> >>> > community >> >>> >> >>> > in >> >>> >> >>> > the >> >>> >> >>> > hopes >> >>> >> >>> > that someone could point me in the right direction. If >> >>> >> >>> > anyone >> >>> >> >>> > has >> >>> >> >>> > seen >> >>> >> >>> > this >> >>> >> >>> > before and can recommend a course of action, I'm all ears. >> >>> >> >>> > And >> >>> >> >>> > if >> >>> >> >>> > there's >> >>> >> >>> > any other information I can provide, please let me know what >> >>> >> >>> > else >> >>> >> >>> > would >> >>> >> >>> > be >> >>> >> >>> > helpful. >> >>> >> >>> > >> >>> >> >>> > Many thanks to any who can lend a hand or teach a man to >> >>> >> >>> > fish. >> >>> >> >>> > >> >>> >> >>> > Brian Felton >> >>> >> >>> > >> >>> >> >>> > _______________________________________________ >> >>> >> >>> > ceph-users mailing list >> >>> >> >>> > ceph-users@xxxxxxxxxxxxxx >> >>> >> >>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >>> >> >>> > >> >>> >> >> >> >>> >> >> >> >>> > >> >>> > >> >> >> >> > > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com