I am trying to understand these drive throttle markers that were mentioned to get an idea of why these drives are marked as slow.:: here is the iostat of the drive /dev/sdbm http://paste.ubuntu.com/9607168/ an IO wait of .79 doesn't seem bad but a write wait of 21.52 seems really high. Looking at the ops in flight:: http://paste.ubuntu.com/9607253/ If we check against all of the osds on this node, this seems strange:: http://paste.ubuntu.com/9607331/ I do not understand why this node has ops in flight while the the remainder seem to be performing without issue. The load on the node is pretty light as well with an average CPU at 16 and an average iowait of .79:: ----------------------------------------------------------------------- /var/run/ceph# iostat -xm /dev/sdbm Linux 3.13.0-40-generic (kh10-4) 12/23/2014 _x86_64_ (40 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 3.94 0.00 23.30 0.79 0.00 71.97 Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sdbm 0.09 0.25 5.03 3.42 0.55 0.63 288.02 0.09 10.56 2.55 22.32 2.54 2.15 ----------------------------------------------------------------------- I am still trying to understand the osd throttle perfdump so if anyone can help shed some light on this that would be rad. From what I can tell from the perfdump 4 osds (the last one, 228, being the slow one currently). I ended up pulling .228 from the cluster and I have yet to see another slow/blocked osd in the output of ceph -s. It is still rebuilding as I just pulled .228 out but I am still getting at least 200MB/s via bonnie while the rebuild is occurring. Finally, if this helps anyone. Although one 1x1Gb takes around 2.0 - 2.5 minutes. If we split a 10 file into 100 x 100MB we get a completion time of about 1 minute. Which would be a 10G file in about 1-1.5 minutes or 166.66MB/s versus the 8MB/s I was getting before with sequential uploads. All of these are coming from a single client via boto. This leads me to think that this is a radosgw issue specifically. This again makes me think that this is not a slow disk issue but an overall radosgw issue. If this were structural in anyway I would think that all of rados/cephs faculties would be hit and the 8MBps limit per client would be due to client throttling due to a ceiling being hit. As it turns out I am not hitting the ceiling but some other aspect of the radosgw or boto is limiting my throughput. Is this logic not correct? I feel like I am missing something. Thanks for the help everyone! _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com