Thanks for the information! Based on my reading of http://ceph.com/docs/next/rbd/rbd-config-ref I was under the impression that rbd cache options wouldn't apply, since presumably the kernel is handling the caching. I'll have to toggle some of those values and see it they make a difference in my setup. I did some additional testing today. If I limit the write benchmark to 1 concurrent operation I see a lower bandwidth number, as I expected. However, when writing to the XFS filesystem on an rbd image I see transfer rates closer to to 400MB/s. # rados -p bench bench 300 write --no-cleanup -t 1 Total time run: 300.105945 Total writes made: 1992 Write size: 4194304 Bandwidth (MB/sec): 26.551 Stddev Bandwidth: 5.69114 Max bandwidth (MB/sec): 40 Min bandwidth (MB/sec): 0 Average Latency: 0.15065 Stddev Latency: 0.0732024 Max latency: 0.617945 Min latency: 0.097339 # time cp -a /mnt/local/climate /mnt/ceph_test1 real 2m11.083s user 0m0.440s sys 1m11.632s # du -h --max-deph=1 /mnt/local 53G /mnt/local/climate This seems to imply that the there is more than one concurrent operation when writing into the filesystem on top of the rbd image. However, given that the filesystem read speeds and the rados benchmark read speeds are much closer in reported bandwidth, it's as if reads are occurring as a single operation. # time cp -a /mnt/ceph_test2/isos /mnt/local/ real 36m2.129s user 0m1.572s sys 3m23.404s # du -h --max-deph=1 /mnt/ceph_test2/ 68G /mnt/ceph_test2/isos Is this apparent single-thread read and multi-thread write with the rbd kernel module the expected mode of operation? If so, could someone explain the reason for this limitation? Based on the information on data striping in http://ceph.com/docs/next/architecture/#data-striping I would assume that a format 1 image would stripe a file larger than the 4MB object size over multiple objects and that those objects would be distributed over multiple OSDs. This would seem to indicate that reading a file back would be much faster since even though Ceph is only reading the primary replica, the read is still distributed over multiple OSDs. At worst I would expect something near the read bandwidth of a single OSD, which would still be much higher than 30-40MB/s. -Steve On 07/24/2014 04:07 PM, Udo Lembke wrote: > Hi Steve, > I'm also looking for improvements of single-thread-reads. > > A little bit higher values (twice?) should be possible with your config. > I have 5 nodes with 60 4-TB hdds and got following: > rados -p test bench -b 4194304 60 seq -t 1 --no-cleanup > Total time run: 60.066934 > Total reads made: 863 > Read size: 4194304 > Bandwidth (MB/sec): 57.469 > Average Latency: 0.0695964 > Max latency: 0.434677 > Min latency: 0.016444 > > In my case I had some osds (xfs) with an high fragmentation (20%). > Changing the mount options and defragmentation help slightly. > Performance changes: > [client] > rbd cache = true > rbd cache writethrough until flush = true > > [osd] > > osd mount options xfs = > "rw,noatime,inode64,logbsize=256k,delaylog,allocsize=4M" > > osd_op_threads = > 4 > > osd_disk_threads = 4 > > > But I expect much more speed for an single thread... > > Udo > > On 23.07.2014 22:13, Steve Anthony wrote: >> Ah, ok. That makes sense. With one concurrent operation I see numbers >> more in line with the read speeds I'm seeing from the filesystems on the >> rbd images. >> >> # rados -p bench bench 300 seq --no-cleanup -t 1 >> Total time run: 300.114589 >> Total reads made: 2795 >> Read size: 4194304 >> Bandwidth (MB/sec): 37.252 >> >> Average Latency: 0.10737 >> Max latency: 0.968115 >> Min latency: 0.039754 >> >> # rados -p bench bench 300 rand --no-cleanup -t 1 >> Total time run: 300.164208 >> Total reads made: 2996 >> Read size: 4194304 >> Bandwidth (MB/sec): 39.925 >> >> Average Latency: 0.100183 >> Max latency: 1.04772 >> Min latency: 0.039584 >> >> I really wish I could find my data on read speeds from a couple weeks >> ago. It's possible that they've always been in this range, but I >> remember one of my test users saturating his 1GbE link over NFS reading >> copying from the rbd client to his workstation. Of course, it's also >> possible that the data set he was using was cached in RAM when he was >> testing, masking the lower rbd speeds. >> >> It just seems counterintuitive to me that read speeds would be so much >> slower that writes at the filesystem layer in practice. With images in >> the 10-100TB range, reading data at 20-60MB/s isn't going to be >> pleasant. Can you suggest any tunables or other approaches to >> investigate to improve these speeds, or are they in line with what you'd >> expect? Thanks for your help! >> >> -Steve >> >> > _______________________________________________ > ceph-users mailing list > ceph-users at lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Steve Anthony LTS HPC Support Specialist Lehigh University sma310 at lehigh.edu