Re: Slow requests

Brad Hubbard <bhubbard@xxxxxxxxxx> · Sat, 21 Oct 2017 11:01:45 +1000

On Fri, Oct 20, 2017 at 8:23 PM, Ольга Ухина <olga.uhina@xxxxxxxxx> wrote:
> I  was able to collect dump data during slow request, but this time I saw
> that it was related to high load average and iowait so I keep watching.
> And it was on particular two osds, but yesterday on other osds.
> I see in dump of these two osds that operations are stuck on queued_for_pg,
> for example:
>
>             "description": "osd_op(client.13057605.0:51528 17.15
> 17:a93a5511:::notify.2:head [watch ping cookie 94259433737472] snapc 0=[]
> ondisk+write+known_if_redirected e10936)",
>             "initiated_at": "2017-10-20 12:34:29.134946",
>             "age": 484.314936,
>             "duration": 55.421058,
>             "type_data": {
>                 "flag_point": "started",
>                 "client_info": {
>                     "client": "client.13057605",
>                     "client_addr": "10.192.1.78:0/3748652520",
>                     "tid": 51528
>                 },
>                 "events": [
>                     {
>                         "time": "2017-10-20 12:34:29.134946",
>                         "event": "initiated"
>                     },
>                     {
>                         "time": "2017-10-20 12:34:29.135075",
>                         "event": "queued_for_pg"

This is set in OSD::enqueue_op
https://github.com/ceph/ceph/blob/34951266fe2ccc14ee1503d62e15a9dffad31a5f/src/osd/OSD.cc#L9050

>                     },
>                     {
>                         "time": "2017-10-20 12:35:24.555957",
>                         "event": "reached_pg"

This is set in OSD::dequeue_op
https://github.com/ceph/ceph/blob/34951266fe2ccc14ee1503d62e15a9dffad31a5f/src/osd/OSD.cc#L9093

Looking at some debug logs for all of the OSDs involved (as previously
mentioned) may help us to work out what was happening during these 55 odd
seconds. Looking at fine-grained system performance statistics for the same
period may also shed some light on what is happening something like sar might
help you to identify any problem area and you can then use other tools to
investigate further.

>                     },
>                     {
>                         "time": "2017-10-20 12:35:24.555978",
>                         "event": "started"
>                     },
>                     {
>                         "time": "2017-10-20 12:35:24.556004",
>                         "event": "done"
>                     }
>                 ]
>             }
>         },
>
>
> I've read thread
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-October/021588.html.
> Very similar problem, can it be connected to Proxmox? I have quite old
> version of proxmox-ve: 4.4-80, and ceph jewel clients on pve nodes.

Anything's possible.

>
> С уважением,
> Ухина Ольга
>
> Моб. тел.: 8(905)-566-46-62
>
> 2017-10-20 11:05 GMT+03:00 Ольга Ухина <olga.uhina@xxxxxxxxx>:
>>
>> Hi! Thanks for your help.
>> How can I increase interval of history for command ceph daemon osd.<id>
>> dump_historic_ops? It shows only for several minutes.
>> I see slow requests on random osds each time and on different hosts (there
>> are three). As I see in logs the problem doesn't relate to scrubbing.
>>
>> Regards,
>> Olga Ukhina
>>
>>
>> 2017-10-20 4:42 GMT+03:00 Brad Hubbard <bhubbard@xxxxxxxxxx>:
>>>
>>> I guess you have both read and followed
>>>
>>> http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/?highlight=backfill#debugging-slow-requests
>>>
>>> What was the result?
>>>
>>> On Fri, Oct 20, 2017 at 2:50 AM, J David <j.david.lists@xxxxxxxxx> wrote:
>>> > On Wed, Oct 18, 2017 at 8:12 AM, Ольга Ухина <olga.uhina@xxxxxxxxx>
>>> > wrote:
>>> >> I have a problem with ceph luminous 12.2.1.
>>> >> […]
>>> >> I have slow requests on different OSDs on random time (for example at
>>> >> night,
>>> >> but I don’t see any problems at the time of problem
>>> >> […]
>>> >> 2017-10-18 01:20:38.187326 mon.st3 mon.0 10.192.1.78:6789/0 22689 :
>>> >> cluster
>>> >> [WRN] Health check update: 49 slow requests are blocked > 32 sec
>>> >> (REQUEST_SLOW)
>>> >
>>> > This looks almost exactly like what we have been experiencing, and
>>> > your use-case (Proxmox client using rbd) is the same as ours as well.
>>> >
>>> > Unfortunately we were not able to find the source of the issue so far,
>>> > and haven’t gotten much feedback from the list.  Extensive testing of
>>> > every component has ruled out any hardware issue we can think of.
>>> >
>>> > Originally we thought our issue was related to deep-scrub, but that
>>> > now appears not to be the case, as it happens even when nothing is
>>> > being deep-scrubbed.  Nonetheless, although they aren’t the cause,
>>> > they definitely make the problem much worse.  So you may want to check
>>> > to see if deep-scrub operations are happening at the times where you
>>> > see issues and (if so) whether the OSDs participating in the
>>> > deep-scrub are the same ones reporting slow requests.
>>> >
>>> > Hopefully you have better luck finding/fixing this than we have!  It’s
>>> > definitely been a very frustrating issue for us.
>>> >
>>> > Thanks!
>>> > _______________________________________________
>>> > ceph-users mailing list
>>> > ceph-users@xxxxxxxxxxxxxx
>>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>>
>>> --
>>> Cheers,
>>> Brad
>>
>>
>

-- 
Cheers,
Brad
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com