Hi All. We're running 180-node cluster in docker containers -- official ceph:hammer. Recently, we've found a rarely reproducible problem on it: sometimes data transfer freezes for significant time (5-15 minutes). The issue is taking place while using radosgw & librados apps (docker-distribution). This problem can be worked around with "ms tcp read timeout" parameter decreased to 2-3 seconds on the client side, but that does not seem to be a good solution. I've written bash script, getting every object (and it's omap/xattr) with 'rados' cli utility from data pool in infinite cycle, to reproduce the problem. Running that on 3 hosts simultaneously on docker-distribution's pool (4mb objects) during 8 hours resulted in 25 reads, each of them took more than 60 seconds. Script results here (hostnames substituted): https://gist.github.com/aaaler/cb190c1eb636564519a5#file-distribution-pool-err-sorted But there's nothing suspicious on corresponding OSD logs. For example, take a look on the one of these faulty reads: 21:44:32 consumed 1891 seconds reading blob:daa46e8d-170e-43ab-8c00-526782f95e02-0 on host1(192.168.1.133) osdmap e80485 pool 'distribution' (17) object 'blob:daa46e8d-170e-43ab-8c00-526782f95e02-0' -> pg 17.97f485f (17.5f) -> up ([139,149,167], p139) acting ([139,149,167], p139) Thus, we've got 1891 seconds of waiting, and after that the client has just proceed without any errors occurred, so I tried to find something useful in osd.139 logs (https://gist.github.com/aaaler/cb190c1eb636564519a5#file-osd-139-log), but could not find anything interesting. Another example (next line in script output) shew us 2983 seconds of reading blob:f5c22093-6e6d-41a6-be36-462330b36c67-71 from osd.56. Again, nothing in osd.56 logs during that time: https://gist.github.com/aaaler/cb190c1eb636564519a5#file-osd-56-log How can I troubleshoot this? As too excessive logging on 180-node cluster will make bunch of traffic and bring problems with finding the right host to check log :( Few words about underlying configuration: - ceph:hammer containers in docker 1.9.1 (--net=host) - gentoo with 3.14.18/3.18.10 kernel. - 1gbps LAN - osd using directory in /var - hosts share osd workload with some other php-fpm's The configuration is pretty default, except some osd parameters configured to reduce scrubbing workload: [osd] osd disk thread ioprio class = idle osd disk thread ioprio priority = 5 osd recovery max active = 1 osd max backfills = 2 -- Sincerely, Alexey Griazin _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com