Re: Slow requests

Brad Hubbard <bhubbard@xxxxxxxxxx> · Fri, 20 Oct 2017 15:11:26 +1000

On Fri, Oct 20, 2017 at 1:09 PM, J David <j.david.lists@xxxxxxxxx> wrote:
> On Thu, Oct 19, 2017 at 9:42 PM, Brad Hubbard <bhubbard@xxxxxxxxxx> wrote:
>> I guess you have both read and followed
>> http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/?highlight=backfill#debugging-slow-requests
>>
>> What was the result?
>
> Not sure if you’re asking Ольга or myself, but in my case, yes we have

Both.

> done exactly that and the result was described in the Oct 16th post on
> the (now poorly-named) “Re:  osd max scrubs not honored?”
> thread.  ( http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-October/021588.html
> )

Well OK then, that explains why I know nothing about it :D

>
> The short version is that even while “ceph -w” is in the midst of
> spitting out “Health check update: XX slow requests are blocked >30
> seconds” messages, running “ceph daemon osd.# ops” for every OSD
> simultaneously reports no ops over about 21 seconds in age, and no
> operations that aren’t either currently in queued_for_pg (most of
> them) or previously in that state for a long time and now rapidly
> completing (a few of them).
>
> The three possible (blindly) speculated causes were:
> - Bad luck/timing due to the in-flight operation capture being a
> manually initiated process.  (Although the results are surprisingly
> consistent.)
> - A locking issue that prevents “ceph daemon osd.# ops” from reporting
> until the problem has gone away.
> - A priority queuing issue causing some requests to get starved out by
> a series of higher priority requests, rather than a single slow
> “smoking gun” request.
>
> Before that, we started with “ceph daemon osd.# dump_historic_ops” but
> it showed roughly the same results.  Without exception, any request
> displayed there that took more than 1 second spent almost its whole
> life in queued_for_pg.
>
> No further information has been gathered since then, as we have no
> idea where to go from here.
>
> Thanks!

Can you determine any pattern at all? Is every OSD affected, or only some?
Usually, in cases such as this, there is a discernible pattern to be seen.

I would suggest turning up debugging on a range of OSDs that you know will
reproduce (I'd try debug_osd 20 and debug_ms 5 if disk space allows) and then
following a single slow request through the logs to see what can be gleaned from
that. I would also be focusing on gathering low level statistics (iostat,
blktrace, etc.) from the suspect disks you mentioned in the other thread to see
if there is some sort of performance issue there.

As an aside if you look at src/common/TrackedOp.cc and search for "slow_op" you
can see how the list is generated so I'm suspecting it's some sort of timing
issue that is stopping you from seeing a more representative presentation in
dump_historic_ops. The trick is to get the identifier for the slow operation and
use that to track it in both dump_historic_ops output and the logs. It may take
a while to show up in dump_historic_ops output as it is only updated when the
event is "put" and marked "done" IIRC.

-- 
Cheers,
Brad
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com