On Fri, Oct 20, 2017 at 1:09 PM, J David <j.david.lists@xxxxxxxxx> wrote: > On Thu, Oct 19, 2017 at 9:42 PM, Brad Hubbard <bhubbard@xxxxxxxxxx> wrote: >> I guess you have both read and followed >> http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/?highlight=backfill#debugging-slow-requests >> >> What was the result? > > Not sure if you’re asking Ольга or myself, but in my case, yes we have Both. > done exactly that and the result was described in the Oct 16th post on > the (now poorly-named) “Re: osd max scrubs not honored?” > thread. ( http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-October/021588.html > ) Well OK then, that explains why I know nothing about it :D > > The short version is that even while “ceph -w” is in the midst of > spitting out “Health check update: XX slow requests are blocked >30 > seconds” messages, running “ceph daemon osd.# ops” for every OSD > simultaneously reports no ops over about 21 seconds in age, and no > operations that aren’t either currently in queued_for_pg (most of > them) or previously in that state for a long time and now rapidly > completing (a few of them). > > The three possible (blindly) speculated causes were: > - Bad luck/timing due to the in-flight operation capture being a > manually initiated process. (Although the results are surprisingly > consistent.) > - A locking issue that prevents “ceph daemon osd.# ops” from reporting > until the problem has gone away. > - A priority queuing issue causing some requests to get starved out by > a series of higher priority requests, rather than a single slow > “smoking gun” request. > > Before that, we started with “ceph daemon osd.# dump_historic_ops” but > it showed roughly the same results. Without exception, any request > displayed there that took more than 1 second spent almost its whole > life in queued_for_pg. > > No further information has been gathered since then, as we have no > idea where to go from here. > > Thanks! Can you determine any pattern at all? Is every OSD affected, or only some? Usually, in cases such as this, there is a discernible pattern to be seen. I would suggest turning up debugging on a range of OSDs that you know will reproduce (I'd try debug_osd 20 and debug_ms 5 if disk space allows) and then following a single slow request through the logs to see what can be gleaned from that. I would also be focusing on gathering low level statistics (iostat, blktrace, etc.) from the suspect disks you mentioned in the other thread to see if there is some sort of performance issue there. As an aside if you look at src/common/TrackedOp.cc and search for "slow_op" you can see how the list is generated so I'm suspecting it's some sort of timing issue that is stopping you from seeing a more representative presentation in dump_historic_ops. The trick is to get the identifier for the slow operation and use that to track it in both dump_historic_ops output and the logs. It may take a while to show up in dump_historic_ops output as it is only updated when the event is "put" and marked "done" IIRC. -- Cheers, Brad _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com