I was able to collect dump data during slow request, but this time I saw that it was related to high load average and iowait so I keep watching.
And it was on particular two osds, but yesterday on other osds.
I see in dump of these two osds that operations are stuck on queued_for_pg, for example:
"description": "osd_op(client.13057605.0:51528 17.15 17:a93a5511:::notify.2:head [watch ping cookie 94259433737472] snapc 0=[] ondisk+write+known_if_redirected e10936)", "initiated_at": "2017-10-20 12:34:29.134946", "age": 484.314936, "duration": 55.421058, "type_data": { "flag_point": "started", "client_info": { "client": "client.13057605", "client_addr": "10.192.1.78:0/3748652520", "tid": 51528 }, "events": [ { "time": "2017-10-20 12:34:29.134946", "event": "initiated" }, { "time": "2017-10-20 12:34:29.135075", "event": "queued_for_pg" }, { "time": "2017-10-20 12:35:24.555957", "event": "reached_pg" }, { "time": "2017-10-20 12:35:24.555978", "event": "started" }, { "time": "2017-10-20 12:35:24.556004", "event": "done" } ] } },
Very similar problem, can it be connected to Proxmox? I have quite old version of proxmox-ve: 4.4-80, and ceph jewel clients on pve nodes.
С уважением,
Ухина Ольга
Моб. тел.: 8(905)-566-46-62
2017-10-20 11:05 GMT+03:00 Ольга Ухина <olga.uhina@xxxxxxxxx>:
Hi! Thanks for your help.How can I increase interval of history for command ceph daemon osd.<id> dump_historic_ops? It shows only for several minutes.I see slow requests on random osds each time and on different hosts (there are three). As I see in logs the problem doesn't relate to scrubbing.Regards,
Olga Ukhina2017-10-20 4:42 GMT+03:00 Brad Hubbard <bhubbard@xxxxxxxxxx>:I guess you have both read and followed
http://docs.ceph.com/docs/master/rados/troubleshooting/troub leshooting-osd/?highlight= backfill#debugging-slow- requests
What was the result?
On Fri, Oct 20, 2017 at 2:50 AM, J David <j.david.lists@xxxxxxxxx> wrote:
> On Wed, Oct 18, 2017 at 8:12 AM, Ольга Ухина <olga.uhina@xxxxxxxxx> wrote:
>> I have a problem with ceph luminous 12.2.1.
>> […]
>> I have slow requests on different OSDs on random time (for example at night,
>> but I don’t see any problems at the time of problem
>> […]
>> 2017-10-18 01:20:38.187326 mon.st3 mon.0 10.192.1.78:6789/0 22689 : cluster
>> [WRN] Health check update: 49 slow requests are blocked > 32 sec
>> (REQUEST_SLOW)
>
> This looks almost exactly like what we have been experiencing, and
> your use-case (Proxmox client using rbd) is the same as ours as well.
>
> Unfortunately we were not able to find the source of the issue so far,
> and haven’t gotten much feedback from the list. Extensive testing of
> every component has ruled out any hardware issue we can think of.
>
> Originally we thought our issue was related to deep-scrub, but that
> now appears not to be the case, as it happens even when nothing is
> being deep-scrubbed. Nonetheless, although they aren’t the cause,
> they definitely make the problem much worse. So you may want to check
> to see if deep-scrub operations are happening at the times where you
> see issues and (if so) whether the OSDs participating in the
> deep-scrub are the same ones reporting slow requests.
>
> Hopefully you have better luck finding/fixing this than we have! It’s
> definitely been a very frustrating issue for us.
>
> Thanks!
--> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Cheers,
Brad
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com