On 04/17/17 15:18, Sage Weil wrote: > On Sat, 15 Apr 2017, Peter Maloney wrote: >> Is this another scrub bug? Something just like this (1 or 2 requests >> blocked forever until osd restart) happened about 5 times so far, each >> time during recovery or some other thing I did myself to trigger it, >> probably involving snapshots. This time I noticed that it says scrub in >> the log. One other time it made a client block, but didn't seem to this >> time. I didn't have the same issue in 10.2.3, but I don't know if I >> generated the same load or whatever causes it back then. >> >> ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367) >> >> If you want me to try 10.2.6 or 7 instead, I can do that, but no >> guarantee I can reproduce it any time soon. > There have been lots of scrub-related patches since 10.2.5, but I don't > see one that would explain this. I'm guessing there is a scrub waitlist > bug that we aren't turning up in qa because our thrashing tests are > triggering lots of other actions in sequence (peering from up/down osds > and balancing) and those probably have the effect of > clearing the issue. > > Next time you see it, can you capture the output of 'ceph daemon osd.NNN > ops' so we can see what steps the request went through? Also, any > additional or more specific clues as to what might have triggered it would > help. > > Thanks! > sage > Here, we go... > 2017-04-20 05:03:58.522624 osd.20 10.3.0.132:6824/1088346 2078 : > cluster [WRN] slow request 20480.953923 seconds old, received at > 2017-04-19 23:22:37.568657: replica scrub(pg: > 4.2d,from:0'0,to:79873'5433886,epoch:79873,start:4:b46a9ab4:::rbd_data.43e4a9238e1f29.00000000000019fe:e438,end:4:b46a9ab4:::rbd_data.43e4a9238e1f29.00000000000019fe:e43a,chunky:1,deep:1,seed:4294967295,version:6) > currently reached_pg > > > root@ceph3:/var/log/ceph # ceph pg dump | grep -E "^pg_stat|^4.2d" > dumped all in format plain > pg_stat objects mip degr misp unf bytes log > disklog state state_stamp v reported up > up_primary acting acting_primary last_scrub scrub_stamp > last_deep_scrub deep_scrub_stamp > 4.2d 3494 0 0 0 0 13061467136 > 3012 3012 active+clean+scrubbing+deep 2017-04-19 > 23:11:22.763008 80142'5447712 80142:3027638 [14,7,20] > 14 [14,7,20] 14 76449'5396453 2017-04-18 > 17:30:01.419498 73551'5298960 2017-04-15 03:23:48.821848 > > ceph daemon osd.14 ops > longscrubbug.14.log # ceph1 > ceph daemon osd.7 ops > longscrubbug.7.log # ceph3 > ceph daemon osd.20 ops > longscrubbug.20.log # ceph2 > > > > peter@peter:~/tmp/longscrubbug $ cat longscrubbug.14.log > { > "ops": [], > "num_ops": 0 > } > > peter@peter:~/tmp/longscrubbug $ cat longscrubbug.7.log > { > "ops": [], > "num_ops": 0 > } > > peter@peter:~/tmp/longscrubbug $ cat longscrubbug.20.log > { > "ops": [ > { > "description": "replica scrub(pg: > 4.2d,from:0'0,to:79873'5433886,epoch:79873,start:4:b46a9ab4:::rbd_data.43e4a9238e1f29.00000000000019fe:e438,end:4:b46a9ab4:::rbd_data.43e4a9238e1f29.00000000000019fe:e43a,chunky:1,deep:1,seed:4294967295,version:6)", > "initiated_at": "2017-04-19 23:22:37.568657", > "age": 30563.826540, > "duration": 30563.826803, > "type_data": [ > "reached pg", > [ > { > "time": "2017-04-19 23:22:37.568657", > "event": "initiated" > }, > { > "time": "2017-04-19 23:22:37.568769", > "event": "queued_for_pg" > }, > { > "time": "2017-04-19 23:22:37.568814", > "event": "reached_pg" > } > ] > ] > } > ], > "num_ops": 1 > } > to me that looks kinda non-useful... anything else you need? (I think I'll leave this one alone for a while in case you reply soon and want something) -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html