Re: Slow requests

J David <j.david.lists@xxxxxxxxx> · Thu, 19 Oct 2017 23:09:42 -0400

On Thu, Oct 19, 2017 at 9:42 PM, Brad Hubbard <bhubbard@xxxxxxxxxx> wrote:
> I guess you have both read and followed
> http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/?highlight=backfill#debugging-slow-requests
>
> What was the result?

Not sure if you’re asking Ольга or myself, but in my case, yes we have
done exactly that and the result was described in the Oct 16th post on
the (now poorly-named) “Re:  osd max scrubs not honored?”
thread.  ( http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-October/021588.html
)

The short version is that even while “ceph -w” is in the midst of
spitting out “Health check update: XX slow requests are blocked >30
seconds” messages, running “ceph daemon osd.# ops” for every OSD
simultaneously reports no ops over about 21 seconds in age, and no
operations that aren’t either currently in queued_for_pg (most of
them) or previously in that state for a long time and now rapidly
completing (a few of them).

The three possible (blindly) speculated causes were:
- Bad luck/timing due to the in-flight operation capture being a
manually initiated process.  (Although the results are surprisingly
consistent.)
- A locking issue that prevents “ceph daemon osd.# ops” from reporting
until the problem has gone away.
- A priority queuing issue causing some requests to get starved out by
a series of higher priority requests, rather than a single slow
“smoking gun” request.

Before that, we started with “ceph daemon osd.# dump_historic_ops” but
it showed roughly the same results.  Without exception, any request
displayed there that took more than 1 second spent almost its whole
life in queued_for_pg.

No further information has been gathered since then, as we have no
idea where to go from here.

Thanks!
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com