On Thu, Oct 19, 2017 at 9:42 PM, Brad Hubbard <bhubbard@xxxxxxxxxx> wrote: > I guess you have both read and followed > http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/?highlight=backfill#debugging-slow-requests > > What was the result? Not sure if you’re asking Ольга or myself, but in my case, yes we have done exactly that and the result was described in the Oct 16th post on the (now poorly-named) “Re: osd max scrubs not honored?” thread. ( http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-October/021588.html ) The short version is that even while “ceph -w” is in the midst of spitting out “Health check update: XX slow requests are blocked >30 seconds” messages, running “ceph daemon osd.# ops” for every OSD simultaneously reports no ops over about 21 seconds in age, and no operations that aren’t either currently in queued_for_pg (most of them) or previously in that state for a long time and now rapidly completing (a few of them). The three possible (blindly) speculated causes were: - Bad luck/timing due to the in-flight operation capture being a manually initiated process. (Although the results are surprisingly consistent.) - A locking issue that prevents “ceph daemon osd.# ops” from reporting until the problem has gone away. - A priority queuing issue causing some requests to get starved out by a series of higher priority requests, rather than a single slow “smoking gun” request. Before that, we started with “ceph daemon osd.# dump_historic_ops” but it showed roughly the same results. Without exception, any request displayed there that took more than 1 second spent almost its whole life in queued_for_pg. No further information has been gathered since then, as we have no idea where to go from here. Thanks! _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com