On Wed, Feb 24, 2016 at 4:57 PM, <aaaler@xxxxxxxxx> wrote:
>So you don't find any slow request
Yes, exactly
>we may has some problems on using poll call. The only potential related PR is https://github.com/ceph/ceph/pull/6971
How we can clarify this hypothesis?
Obviously, we can't install experimental osd version on our production cluster.
BTW there's no unususal cpu usage on osd host at that moment.
I don't remember anyone who report alike problem, so if you improve debug_ms=10/10 may help to find more. And the other difference in your cluster is ceph running in docker via NET, I'm not sure whether exists some potential problems, but obviously it's not a very common deploy
2016-02-20 9:01 GMT+03:00 Haomai Wang <haomaiwang@xxxxxxxxx>:
>
>
> On Sat, Feb 20, 2016 at 2:26 AM, <aaaler@xxxxxxxxx> wrote:
>>
>> Hi All.
>>
>> We're running 180-node cluster in docker containers -- official
>> ceph:hammer.
>> Recently, we've found a rarely reproducible problem on it: sometimes
>> data transfer freezes for significant time (5-15 minutes). The issue
>> is taking place while using radosgw & librados apps
>> (docker-distribution). This problem can be worked around with "ms tcp
>> read timeout" parameter decreased to 2-3 seconds on the client side,
>> but that does not seem to be a good solution.
>> I've written bash script, getting every object (and it's omap/xattr)
>> with 'rados' cli utility from data pool in infinite cycle, to
>> reproduce the problem. Running that on 3 hosts simultaneously on
>> docker-distribution's pool (4mb objects) during 8 hours resulted in 25
>> reads, each of them took more than 60 seconds.
>> Script results here (hostnames substituted):
>>
>> https://gist.github.com/aaaler/cb190c1eb636564519a5#file-distribution-pool-err-sorted
>> But there's nothing suspicious on corresponding OSD logs.
>> For example, take a look on the one of these faulty reads:
>> 21:44:32 consumed 1891 seconds reading
>> blob:daa46e8d-170e-43ab-8c00-526782f95e02-0 on host1(192.168.1.133)
>> osdmap e80485 pool 'distribution' (17) object
>> 'blob:daa46e8d-170e-43ab-8c00-526782f95e02-0' -> pg 17.97f485f (17.5f)
>> -> up ([139,149,167], p139) acting ([139,149,167], p139)
>>
>> Thus, we've got 1891 seconds of waiting, and after that the client has
>> just proceed without any errors occurred, so I tried to find something
>> useful in osd.139 logs
>> (https://gist.github.com/aaaler/cb190c1eb636564519a5#file-osd-139-log),
>> but could not find anything interesting.
>>
>> Another example (next line in script output) shew us 2983 seconds of
>> reading blob:f5c22093-6e6d-41a6-be36-462330b36c67-71 from osd.56.
>> Again, nothing in osd.56 logs during that time:
>> https://gist.github.com/aaaler/cb190c1eb636564519a5#file-osd-56-log
>>
>> How can I troubleshoot this? As too excessive logging on 180-node
>> cluster will make bunch of traffic and bring problems with finding the
>> right host to check log :(
>
>
> So you don't find any slow request with ceph -s or at /var/log/ceph/ceph.log
> in monitor side?
>
> You mentioned "ms tcp read timeout" has some effects on your case, I guess
> we may has some problems on using poll call.
>
> The only potential related PR is https://github.com/ceph/ceph/pull/6971
>
>>
>> Few words about underlying configuration:
>> - ceph:hammer containers in docker 1.9.1 (--net=host)
>> - gentoo with 3.14.18/3.18.10 kernel.
>> - 1gbps LAN
>> - osd using directory in /var
>> - hosts share osd workload with some other php-fpm's
>>
>> The configuration is pretty default, except some osd parameters
>> configured to reduce scrubbing workload:
>> [osd]
>> osd disk thread ioprio class = idle
>> osd disk thread ioprio priority = 5
>> osd recovery max active = 1
>> osd max backfills = 2
>>
>> --
>> Sincerely, Alexey Griazin
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>
> --
>
> Best Regards,
>
> Wheat
Best Regards,
Wheat
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com